Skip to content

Isolate the process-manager daemon per home on Windows#4061

Draft
gcsecsey wants to merge 9 commits into
fix-windows-e2e-daemon-leakfrom
fix-windows-daemon-pipe-isolation
Draft

Isolate the process-manager daemon per home on Windows#4061
gcsecsey wants to merge 9 commits into
fix-windows-e2e-daemon-leakfrom
fix-windows-daemon-pipe-isolation

Conversation

@gcsecsey

@gcsecsey gcsecsey commented Jul 2, 2026

Copy link
Copy Markdown
Member

Related issues

How AI was used in this PR

Claude cross-referenced the AINFRA-2588 infra investigation with a local reproduction on a Windows machine. It traced the "same-daemon-shared-across-runs" behavior to apps/cli/lib/paths.ts hardcoding the Windows named-pipe names, wrote the fix and a unit test, and verified before/after on Windows (two custom homes share one daemon on trunk vs. two isolated daemons with the fix; the default home still shares a single daemon).

Proposed Changes

The CLI's process-manager daemon is meant to be isolatable per run. Setting STUDIO_PROCESS_MANAGER_HOME gives you your own daemon and sites, which is what the CLI integration-test harness relies on, so tests don't touch each other or a developer's real daemon.

On Windows, that isolation silently did nothing. Unix domain sockets live under the home directory, but Windows named pipes share one flat, machine-global namespace, and the pipe names were hardcoded. So every process on the machine is connected to the same daemon regardless of STUDIO_PROCESS_MANAGER_HOME. As a result, CLI tests (and anything else) on Windows all shared one machine-global daemon: leaked/hung sites from one run bled into the next, and a wedged daemon could poison later runs and other CI jobs on the same agent.

This PR derives the Windows pipe name from the home directory when a custom home is set, so isolation works on Windows the same way it already does on macOS/Linux. The default home keeps its original fixed pipe name, so the shipping desktop app and CLI still share a single daemon exactly as before.

Testing Instructions

  • Unit tests: npm test -- apps/cli/tests/paths.test.ts
  • Manual (in PowerShell): run two CLI commands with two different STUDIO_PROCESS_MANAGER_HOME values and confirm two separate process-manager-daemon processes now exist. Two invocations with the default home still yield a single shared daemon.
# --- setup ---
npm run cli:build

$cli = "apps\cli\dist\cli\main.mjs"
function Kill-StudioDaemons  { Get-CimInstance Win32_Process | ? { $_.CommandLine -match 'process-manager-daemon\.mjs' } | % { taskkill /F /T /PID $_.ProcessId 2>$null | Out-Null } }
function Count-StudioDaemons { @(Get-CimInstance Win32_Process | ? { $_.CommandLine -match 'process-manager-daemon\.mjs' }).Count }

# --- TEST 1: two custom homes -> two isolated daemons (the fix) ---
Kill-StudioDaemons
$env:STUDIO_PROCESS_MANAGER_HOME = "$env:TEMP\studio-home-A"; node $cli site list | Out-Null
$env:STUDIO_PROCESS_MANAGER_HOME = "$env:TEMP\studio-home-B"; node $cli site list | Out-Null
Start-Sleep -Milliseconds 500
"Isolated daemons: $(Count-StudioDaemons)   (EXPECT 2)"
"home-A logs: $(Test-Path "$env:TEMP\studio-home-A\logs")   home-B logs: $(Test-Path "$env:TEMP\studio-home-B\logs")   (both EXPECT True)"

# --- TEST 2: default home -> one shared daemon (no regression) ---
Remove-Item Env:\STUDIO_PROCESS_MANAGER_HOME
Kill-StudioDaemons
node $cli site list | Out-Null; node $cli site list | Out-Null
Start-Sleep -Milliseconds 500
"Shared daemons (default home): $(Count-StudioDaemons)   (EXPECT 1)"
  • CI (stacked): the E2E Tests on windows-x64 job should reach a terminal state well under the 180-minute timeout, with no CAPACITY_LIMIT_REACHED (36/36) cascade in the log.

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

@wpmobilebot

Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing 7347fd7 vs trunk

app-size

Metric trunk 7347fd7 Diff Change
App Size (Mac) 1316.92 MB 1316.83 MB 0.09 MB ⚪ 0.0%

site-editor

Metric trunk 7347fd7 Diff Change
load 1066 ms 1056 ms 10 ms ⚪ 0.0%

site-startup

Metric trunk 7347fd7 Diff Change
siteCreation 6531 ms 6488 ms 43 ms ⚪ 0.0%
siteStartup 1855 ms 1864 ms +9 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

@gcsecsey gcsecsey requested review from a team July 2, 2026 14:07
@gcsecsey gcsecsey requested a review from a team as a code owner July 2, 2026 14:08
@gcsecsey gcsecsey changed the base branch from trunk to fix-windows-e2e-daemon-leak July 2, 2026 14:08
@gcsecsey gcsecsey requested a review from a team as a code owner July 2, 2026 14:08
@gcsecsey gcsecsey force-pushed the fix-windows-daemon-pipe-isolation branch from b77bcb5 to d67ccb2 Compare July 2, 2026 14:10
On Windows the daemon's control/events sockets were hardcoded named pipes, so every
process shared one machine-global daemon regardless of STUDIO_PROCESS_MANAGER_HOME —
silently defeating the per-run isolation that already works on macOS/Linux and that the
CLI test harness relies on. Derive the pipe name from the home when a custom one is set;
the default home keeps its original fixed name so the desktop app and CLI still share a
single daemon exactly as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
@gcsecsey gcsecsey force-pushed the fix-windows-daemon-pipe-isolation branch from d67ccb2 to 85b86f7 Compare July 2, 2026 15:53

@bcotrim bcotrim left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Windows — pipe isolation works: custom STUDIO_PROCESS_MANAGER_HOME gets its own daemon on a hashed pipe, default home keeps the fixed names.

On suggested fix:
killLeakedDaemonProcesses() taskkills its own cmd.exe wrapper (its command line also contains process-manager-daemon), so the sweep dies mid-loop and can leave daemons alive.

Replicate (Windows):

  1. In e2e-helpers.ts, log the swallowed error: catch ( error ) { console.log( '[daemon reap]', error ) }
  2. npm run e2e -- app.test.ts
  3. Cleanup prints [daemon reap] Error: Command failed: powershell … — suite still reports green.

Fix:

-  "Where-Object { $_.CommandLine -match 'process-manager-daemon' -and $_.ProcessId -ne $PID } | " +
+  "Where-Object { $_.CommandLine -match 'process[-]manager[-]daemon' } | " +

[-] still matches the daemon's command line, but not the sweep's own (which contains the brackets). After the fix, step 2 shows no error and no daemon survives.

gcsecsey and others added 3 commits July 3, 2026 08:46
The build skipped the PHP download whenever apps/studio/php-bin/<packageId>/ already
existed, so a package republished in place (same packageVersion, new checksum — e.g.
adding the Windows VC++ runtime DLLs) was never picked up on reused CI agents, which kept
bundling a stale php.exe. Record the archive SHA on install and re-download when it no
longer matches the expected SHA.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
Shorten stale-PHP log line to satisfy prettier

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
@
@gcsecsey

gcsecsey commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

Thanks for the review @bcotrim! I'll keep iterating on this to try to pinpoint the root cause on CI, for the time being I'm converting it back to a draft.

@gcsecsey gcsecsey marked this pull request as draft July 3, 2026 11:31
Temporary instrumentation to find why native PHP doesn't serve on Windows CI.
Routes the E2E process-manager daemon and its per-process logs under
test-results/daemon-logs so they upload as artifacts, and logs any unexpected
php.exe exit code (a Windows loader/DLL failure exits 3221225781 with no stderr).
Revert once diagnosed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
gcsecsey and others added 4 commits July 3, 2026 13:09
The previous commit broke Lint (a too-long console.error) and the linux/mac E2E
(STUDIO_PROCESS_MANAGER_HOME pointed the Unix daemon socket at a path past sun_path's
~108-char limit). Gate the daemon-home override to win32 (the platform we're diagnosing,
where the pipe name is hashed to a short name), shorten the path, and wrap the log line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
The post-suite Playwright runner hang makes the windows E2E job time out and skip
the artifact upload, so the daemon-logs artifact never arrives. Instead, on test
failure, dump the detached daemon's per-site -error.log files (which hold php.exe's
[PHP-DIAG] exit code and stderr) straight into the Playwright job log, so the real
native-PHP failure is readable from the streamed CI log mid-run. Also collapses the
win32 daemon-home ternary onto one line to satisfy prettier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
os.tmpdir() can resolve to a Windows 8.3 short name (e.g. C:\Users\BUILDK~1\...) when
the account name exceeds 8 chars, as on CI's buildkite-agent. That path was handed to
PHP as auto_prepend_file (and the phpMyAdmin config), and PHP's parser chokes on the
tilde with "syntax error, unexpected '~'", so every site failed to serve and its tests
timed out on ERR_CONNECTION_REFUSED. Local dev with a short username produced no tilde,
which is why this only reproduced on CI.

Add getPhpSafeTmpDir() (long-form tmpdir on Windows) and use it for every temp path
handed to PHP, and quote the auto_prepend_file directive to match the others.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
…oad workaround

Reverts the temporary daemon-log streaming, [PHP-DIAG] exit logging, and per-session
daemon-home routing used to surface the native-PHP failure, plus the SHA-marker
re-download in download-php-binary.ts, which targeted an incorrect "stale bundle /
missing VC++ DLL" theory. The real cause is fixed in the preceding commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants