Fix Windows E2E hang: bound daemon requests and reap leaked daemon between sessions by gavande1 · Pull Request #4041 · Automattic/studio

gavande1 · 2026-07-02T06:57:14Z

Related issues

Related to AINFRA-2588 (Investigate Studio Windows E2E hangs in Buildkite)

How AI was used in this PR

Claude analyzed the Buildkite log from a timed-out Windows E2E run (build #18530), traced the 2.1-hour suite duration to a leak chain in the machine-global process-manager daemon, identified the unbounded socket waits that allow the quit-time site stop --all to hang, implemented the fix across the CLI daemon/socket layers and the E2E harness, and added unit tests. All findings were verified against the code; the exact hanging line on Windows still needs a live CI run to confirm, which is why this PR re-enables the Windows E2E job.

Proposed Changes

Windows E2E jobs have been timing out at the 180-minute job limit, blocking merges into trunk because the limited Windows CI workers stay occupied for the full duration.

The cause is a leak chain. On Windows CI, each test session's quit-time "stop all sites" command hangs and gets force-killed, so that session's WordPress servers keep running inside the process-manager daemon — which is machine-global (a fixed named pipe), shared across every session. Playground sites each consume 6 of the daemon's 36 capacity units, so after only six leaked sites the cap is exhausted and every subsequent site creation fails by timeout, stretching the suite far past the job limit. Leaked php.exe processes also hold DLL locks that break session cleanup and delay runner exit.

This PR breaks the chain at several levels so it holds even if one product-side hang remains:

Daemon requests can no longer hang forever waiting on a wedged daemon that accepts a connection but never replies.
The daemon's shutdown always completes and frees capacity, even when a child process cannot be killed.
The E2E harness reaps any surviving daemon between sessions, so leaked sites can't accumulate across the suite, and cleanup no longer aborts when the app failed to launch.
The quit-time stop now logs the CLI's progress events, so any future hang shows exactly how far it got.
Windows E2E is re-enabled in CI to verify the fix.

Note: some Windows E2E tests may still fail for an unrelated reason — the log shows a separate Windows-only crash during site creation (Assigning port… step) that predates the capacity cascade. That needs its own follow-up; this PR's goal is to stop the hang so the suite finishes in a normal amount of time and that failure becomes visible.

Testing Instructions

CI: The E2E Tests on windows-x64 job should reach a terminal state well under the 180-minute timeout, with no CAPACITY_LIMIT_REACHED (36/36) errors in the log.
Unit: npm test -- apps/cli/tests/ (817 tests pass, including new daemon force-settle and socket response-timeout tests).

Pre-merge Checklist

Have you checked for TypeScript, React or other console errors?

🤖 Generated with Claude Code

…tween sessions Every session's quit-time 'site stop --all' hangs past its 20s timeout on Windows CI, leaving that session's site servers running in the machine-global process-manager daemon. Playground sites weigh 6 capacity units, so six leaked sites exhaust the 36-unit cap; every later createSite then fails by timeout, stretching the suite past the 180-minute job limit, and leaked php.exe processes block session cleanup and runner exit. - SocketRequestClient now times out waiting for a response, so a wedged daemon can no longer hang CLI commands forever - The daemon's stopProcess settles even if a child never reports exit, so kill-daemon always completes and capacity is freed - E2E cleanup reaps any surviving daemon tree between sessions and no longer aborts when the app failed to launch - The quit-time stop logs CLI progress events for future diagnosis - Re-enable Windows E2E in CI to verify (AINFRA-2588) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

wpmobilebot · 2026-07-02T07:16:13Z

📊 Performance Test Results

Comparing 78deb0b vs trunk

app-size

Metric	trunk	`78deb0b`	Diff	Change
App Size (Mac)	1317.24 MB	1317.24 MB	+0.00 MB	⚪ 0.0%

site-editor

Metric	trunk	`78deb0b`	Diff	Change
load	1075 ms	1075 ms	0 ms	⚪ 0.0%

site-startup

Metric	trunk	`78deb0b`	Diff	Change
siteCreation	6526 ms	6504 ms	22 ms	⚪ 0.0%
siteStartup	1859 ms	1856 ms	3 ms	⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

On Windows the `site stop --all` CLI stops the sites and reports success within a few hundred ms, but its process can linger without self-exiting — so stopAllServers waited out the full quit timeout (20s in E2E) and force-killed it on every session, adding ~20s per session to the suite. Act on the CLI's reported completion event and reap the process then, instead of waiting for it to exit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS

gavande1 requested a review from a team as a code owner July 2, 2026 06:57

github-actions Bot assigned gavande1 Jul 2, 2026

gcsecsey mentioned this pull request Jul 2, 2026

Isolate the process-manager daemon per home on Windows #4061

Draft

1 task

gavande1 mentioned this pull request Jul 3, 2026

Fix Windows E2E test failures: bind native PHP site proxy to 127.0.0.1 #4067

Open

1 task

Merge branch 'trunk' into fix-windows-e2e-daemon-leak

78deb0b

gavande1 mentioned this pull request Jul 3, 2026

Fix Windows E2E hangs: restore per-command CLI shutdown and isolate the daemon per home #4075

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Windows E2E hang: bound daemon requests and reap leaked daemon between sessions#4041

Fix Windows E2E hang: bound daemon requests and reap leaked daemon between sessions#4041
gavande1 wants to merge 3 commits into
trunkfrom
fix-windows-e2e-daemon-leak

gavande1 commented Jul 2, 2026

Uh oh!

wpmobilebot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

gavande1 commented Jul 2, 2026

Related issues

How AI was used in this PR

Proposed Changes

Testing Instructions

Pre-merge Checklist

Uh oh!

wpmobilebot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Performance Test Results

app-size

site-editor

site-startup

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wpmobilebot commented Jul 2, 2026 •

edited

Loading