Isolate the process-manager daemon per home on Windows#4061
Conversation
📊 Performance Test ResultsComparing 7347fd7 vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
b77bcb5 to
d67ccb2
Compare
On Windows the daemon's control/events sockets were hardcoded named pipes, so every process shared one machine-global daemon regardless of STUDIO_PROCESS_MANAGER_HOME — silently defeating the per-run isolation that already works on macOS/Linux and that the CLI test harness relies on. Derive the pipe name from the home when a custom one is set; the default home keeps its original fixed name so the desktop app and CLI still share a single daemon exactly as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
d67ccb2 to
85b86f7
Compare
bcotrim
left a comment
There was a problem hiding this comment.
Tested on Windows — pipe isolation works: custom STUDIO_PROCESS_MANAGER_HOME gets its own daemon on a hashed pipe, default home keeps the fixed names.
On suggested fix:
killLeakedDaemonProcesses() taskkills its own cmd.exe wrapper (its command line also contains process-manager-daemon), so the sweep dies mid-loop and can leave daemons alive.
Replicate (Windows):
- In
e2e-helpers.ts, log the swallowed error:catch ( error ) { console.log( '[daemon reap]', error ) } npm run e2e -- app.test.ts- Cleanup prints
[daemon reap] Error: Command failed: powershell …— suite still reports green.
Fix:
- "Where-Object { $_.CommandLine -match 'process-manager-daemon' -and $_.ProcessId -ne $PID } | " +
+ "Where-Object { $_.CommandLine -match 'process[-]manager[-]daemon' } | " +[-] still matches the daemon's command line, but not the sweep's own (which contains the brackets). After the fix, step 2 shows no error and no daemon survives.
The build skipped the PHP download whenever apps/studio/php-bin/<packageId>/ already existed, so a package republished in place (same packageVersion, new checksum — e.g. adding the Windows VC++ runtime DLLs) was never picked up on reused CI agents, which kept bundling a stale php.exe. Record the archive SHA on install and re-download when it no longer matches the expected SHA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
Shorten stale-PHP log line to satisfy prettier Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS @
|
Thanks for the review @bcotrim! I'll keep iterating on this to try to pinpoint the root cause on CI, for the time being I'm converting it back to a draft. |
Temporary instrumentation to find why native PHP doesn't serve on Windows CI. Routes the E2E process-manager daemon and its per-process logs under test-results/daemon-logs so they upload as artifacts, and logs any unexpected php.exe exit code (a Windows loader/DLL failure exits 3221225781 with no stderr). Revert once diagnosed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
The previous commit broke Lint (a too-long console.error) and the linux/mac E2E (STUDIO_PROCESS_MANAGER_HOME pointed the Unix daemon socket at a path past sun_path's ~108-char limit). Gate the daemon-home override to win32 (the platform we're diagnosing, where the pipe name is hashed to a short name), shorten the path, and wrap the log line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
The post-suite Playwright runner hang makes the windows E2E job time out and skip the artifact upload, so the daemon-logs artifact never arrives. Instead, on test failure, dump the detached daemon's per-site -error.log files (which hold php.exe's [PHP-DIAG] exit code and stderr) straight into the Playwright job log, so the real native-PHP failure is readable from the streamed CI log mid-run. Also collapses the win32 daemon-home ternary onto one line to satisfy prettier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
os.tmpdir() can resolve to a Windows 8.3 short name (e.g. C:\Users\BUILDK~1\...) when the account name exceeds 8 chars, as on CI's buildkite-agent. That path was handed to PHP as auto_prepend_file (and the phpMyAdmin config), and PHP's parser chokes on the tilde with "syntax error, unexpected '~'", so every site failed to serve and its tests timed out on ERR_CONNECTION_REFUSED. Local dev with a short username produced no tilde, which is why this only reproduced on CI. Add getPhpSafeTmpDir() (long-form tmpdir on Windows) and use it for every temp path handed to PHP, and quote the auto_prepend_file directive to match the others. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
…oad workaround Reverts the temporary daemon-log streaming, [PHP-DIAG] exit logging, and per-session daemon-home routing used to surface the native-PHP failure, plus the SHA-marker re-download in download-php-binary.ts, which targeted an incorrect "stale bundle / missing VC++ DLL" theory. The real cause is fixed in the preceding commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vfk7xXMfsABh5JMYX51wiS
Related issues
How AI was used in this PR
Claude cross-referenced the AINFRA-2588 infra investigation with a local reproduction on a Windows machine. It traced the "same-daemon-shared-across-runs" behavior to
apps/cli/lib/paths.tshardcoding the Windows named-pipe names, wrote the fix and a unit test, and verified before/after on Windows (two custom homes share one daemon on trunk vs. two isolated daemons with the fix; the default home still shares a single daemon).Proposed Changes
The CLI's process-manager daemon is meant to be isolatable per run. Setting
STUDIO_PROCESS_MANAGER_HOMEgives you your own daemon and sites, which is what the CLI integration-test harness relies on, so tests don't touch each other or a developer's real daemon.On Windows, that isolation silently did nothing. Unix domain sockets live under the home directory, but Windows named pipes share one flat, machine-global namespace, and the pipe names were hardcoded. So every process on the machine is connected to the same daemon regardless of
STUDIO_PROCESS_MANAGER_HOME. As a result, CLI tests (and anything else) on Windows all shared one machine-global daemon: leaked/hung sites from one run bled into the next, and a wedged daemon could poison later runs and other CI jobs on the same agent.This PR derives the Windows pipe name from the home directory when a custom home is set, so isolation works on Windows the same way it already does on macOS/Linux. The default home keeps its original fixed pipe name, so the shipping desktop app and CLI still share a single daemon exactly as before.
Testing Instructions
npm test -- apps/cli/tests/paths.test.tsSTUDIO_PROCESS_MANAGER_HOMEvalues and confirm two separateprocess-manager-daemonprocesses now exist. Two invocations with the default home still yield a single shared daemon.E2E Tests on windows-x64job should reach a terminal state well under the 180-minute timeout, with noCAPACITY_LIMIT_REACHED (36/36)cascade in the log.Pre-merge Checklist