Fix Windows E2E hang: bound daemon requests and reap leaked daemon between sessions#4041
Open
gavande1 wants to merge 3 commits into
Open
Fix Windows E2E hang: bound daemon requests and reap leaked daemon between sessions#4041gavande1 wants to merge 3 commits into
gavande1 wants to merge 3 commits into
Conversation
…tween sessions Every session's quit-time 'site stop --all' hangs past its 20s timeout on Windows CI, leaving that session's site servers running in the machine-global process-manager daemon. Playground sites weigh 6 capacity units, so six leaked sites exhaust the 36-unit cap; every later createSite then fails by timeout, stretching the suite past the 180-minute job limit, and leaked php.exe processes block session cleanup and runner exit. - SocketRequestClient now times out waiting for a response, so a wedged daemon can no longer hang CLI commands forever - The daemon's stopProcess settles even if a child never reports exit, so kill-daemon always completes and capacity is freed - E2E cleanup reaps any surviving daemon tree between sessions and no longer aborts when the app failed to launch - The quit-time stop logs CLI progress events for future diagnosis - Re-enable Windows E2E in CI to verify (AINFRA-2588) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Collaborator
📊 Performance Test ResultsComparing 78deb0b vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
1 task
On Windows the `site stop --all` CLI stops the sites and reports success within a few hundred ms, but its process can linger without self-exiting — so stopAllServers waited out the full quit timeout (20s in E2E) and force-killed it on every session, adding ~20s per session to the suite. Act on the CLI's reported completion event and reap the process then, instead of waiting for it to exit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
1 task
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related issues
How AI was used in this PR
Claude analyzed the Buildkite log from a timed-out Windows E2E run (build #18530), traced the 2.1-hour suite duration to a leak chain in the machine-global process-manager daemon, identified the unbounded socket waits that allow the quit-time
site stop --allto hang, implemented the fix across the CLI daemon/socket layers and the E2E harness, and added unit tests. All findings were verified against the code; the exact hanging line on Windows still needs a live CI run to confirm, which is why this PR re-enables the Windows E2E job.Proposed Changes
Windows E2E jobs have been timing out at the 180-minute job limit, blocking merges into trunk because the limited Windows CI workers stay occupied for the full duration.
The cause is a leak chain. On Windows CI, each test session's quit-time "stop all sites" command hangs and gets force-killed, so that session's WordPress servers keep running inside the process-manager daemon — which is machine-global (a fixed named pipe), shared across every session. Playground sites each consume 6 of the daemon's 36 capacity units, so after only six leaked sites the cap is exhausted and every subsequent site creation fails by timeout, stretching the suite far past the job limit. Leaked
php.exeprocesses also hold DLL locks that break session cleanup and delay runner exit.This PR breaks the chain at several levels so it holds even if one product-side hang remains:
Note: some Windows E2E tests may still fail for an unrelated reason — the log shows a separate Windows-only crash during site creation (
Assigning port…step) that predates the capacity cascade. That needs its own follow-up; this PR's goal is to stop the hang so the suite finishes in a normal amount of time and that failure becomes visible.Testing Instructions
E2E Tests on windows-x64job should reach a terminal state well under the 180-minute timeout, with noCAPACITY_LIMIT_REACHED (36/36)errors in the log.npm test -- apps/cli/tests/(817 tests pass, including new daemon force-settle and socket response-timeout tests).Pre-merge Checklist
🤖 Generated with Claude Code