Skip to content

Fix Windows E2E hang: bound daemon requests and reap leaked daemon between sessions#4041

Open
gavande1 wants to merge 3 commits into
trunkfrom
fix-windows-e2e-daemon-leak
Open

Fix Windows E2E hang: bound daemon requests and reap leaked daemon between sessions#4041
gavande1 wants to merge 3 commits into
trunkfrom
fix-windows-e2e-daemon-leak

Conversation

@gavande1

@gavande1 gavande1 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Related issues

  • Related to AINFRA-2588 (Investigate Studio Windows E2E hangs in Buildkite)

How AI was used in this PR

Claude analyzed the Buildkite log from a timed-out Windows E2E run (build #18530), traced the 2.1-hour suite duration to a leak chain in the machine-global process-manager daemon, identified the unbounded socket waits that allow the quit-time site stop --all to hang, implemented the fix across the CLI daemon/socket layers and the E2E harness, and added unit tests. All findings were verified against the code; the exact hanging line on Windows still needs a live CI run to confirm, which is why this PR re-enables the Windows E2E job.

Proposed Changes

Windows E2E jobs have been timing out at the 180-minute job limit, blocking merges into trunk because the limited Windows CI workers stay occupied for the full duration.

The cause is a leak chain. On Windows CI, each test session's quit-time "stop all sites" command hangs and gets force-killed, so that session's WordPress servers keep running inside the process-manager daemon — which is machine-global (a fixed named pipe), shared across every session. Playground sites each consume 6 of the daemon's 36 capacity units, so after only six leaked sites the cap is exhausted and every subsequent site creation fails by timeout, stretching the suite far past the job limit. Leaked php.exe processes also hold DLL locks that break session cleanup and delay runner exit.

This PR breaks the chain at several levels so it holds even if one product-side hang remains:

  • Daemon requests can no longer hang forever waiting on a wedged daemon that accepts a connection but never replies.
  • The daemon's shutdown always completes and frees capacity, even when a child process cannot be killed.
  • The E2E harness reaps any surviving daemon between sessions, so leaked sites can't accumulate across the suite, and cleanup no longer aborts when the app failed to launch.
  • The quit-time stop now logs the CLI's progress events, so any future hang shows exactly how far it got.
  • Windows E2E is re-enabled in CI to verify the fix.

Note: some Windows E2E tests may still fail for an unrelated reason — the log shows a separate Windows-only crash during site creation (Assigning port… step) that predates the capacity cascade. That needs its own follow-up; this PR's goal is to stop the hang so the suite finishes in a normal amount of time and that failure becomes visible.

Testing Instructions

  • CI: The E2E Tests on windows-x64 job should reach a terminal state well under the 180-minute timeout, with no CAPACITY_LIMIT_REACHED (36/36) errors in the log.
  • Unit: npm test -- apps/cli/tests/ (817 tests pass, including new daemon force-settle and socket response-timeout tests).

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

🤖 Generated with Claude Code

…tween sessions

Every session's quit-time 'site stop --all' hangs past its 20s timeout on
Windows CI, leaving that session's site servers running in the machine-global
process-manager daemon. Playground sites weigh 6 capacity units, so six leaked
sites exhaust the 36-unit cap; every later createSite then fails by timeout,
stretching the suite past the 180-minute job limit, and leaked php.exe
processes block session cleanup and runner exit.

- SocketRequestClient now times out waiting for a response, so a wedged daemon
  can no longer hang CLI commands forever
- The daemon's stopProcess settles even if a child never reports exit, so
  kill-daemon always completes and capacity is freed
- E2E cleanup reaps any surviving daemon tree between sessions and no longer
  aborts when the app failed to launch
- The quit-time stop logs CLI progress events for future diagnosis
- Re-enable Windows E2E in CI to verify (AINFRA-2588)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@gavande1 gavande1 requested a review from a team as a code owner July 2, 2026 06:57
@wpmobilebot

wpmobilebot commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing 78deb0b vs trunk

app-size

Metric trunk 78deb0b Diff Change
App Size (Mac) 1317.24 MB 1317.24 MB +0.00 MB ⚪ 0.0%

site-editor

Metric trunk 78deb0b Diff Change
load 1075 ms 1075 ms 0 ms ⚪ 0.0%

site-startup

Metric trunk 78deb0b Diff Change
siteCreation 6526 ms 6504 ms 22 ms ⚪ 0.0%
siteStartup 1859 ms 1856 ms 3 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

On Windows the `site stop --all` CLI stops the sites and reports success within a few
hundred ms, but its process can linger without self-exiting — so stopAllServers waited out
the full quit timeout (20s in E2E) and force-killed it on every session, adding ~20s per
session to the suite. Act on the CLI's reported completion event and reap the process then,
instead of waiting for it to exit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants