fix(broker-lifecycle): reject crashed brokers via PID-alive probe#262
fix(broker-lifecycle): reject crashed brokers via PID-alive probe#262Dimaris-nsk wants to merge 1 commit intoopenai:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5c5f4f40e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| export async function ensureBrokerSession(cwd, options = {}) { | ||
| const existing = loadBrokerSession(cwd); | ||
| if (existing && (await isBrokerEndpointReady(existing.endpoint))) { | ||
| if (existing && !isSessionStale(existing) && (await isBrokerEndpointReady(existing.endpoint))) { |
There was a problem hiding this comment.
Terminate stale broker before rotating to a new session
This new !isSessionStale(existing) gate sends age-expired (but still running) brokers through the existing teardown branch, but that branch does not terminate the broker unless killProcess is supplied, and ensureBrokerSession() normally runs without it. In that common path, only files/socket are deleted while the detached broker keeps running, so each age-based rotation can leak another orphan broker/app-server process and gradually consume resources.
Useful? React with 👍 / 👎.
| // PID check — covers crashed-broker case | ||
| if (session.pid != null && !isPidAlive(session.pid)) return true; | ||
| // Age check — covers slow-degradation case | ||
| if (session.startedAt) { |
There was a problem hiding this comment.
Force one-time rotation when legacy sessions lack startedAt
The age check only executes when session.startedAt exists, but startedAt is introduced in this commit, so pre-upgrade broker.json entries skip age validation entirely. If an old unhealthy broker still has a live PID and responds to the socket ping, this logic will keep reusing it indefinitely, which means the stale-session fix does not apply to the very existing sessions users already have until they manually clear state.
Useful? React with 👍 / 👎.
d5c5f4f to
6ce0540
Compare
|
Thanks @chatgpt-codex-connector — both findings applied in 1. Legacy sessions (P1 — missing startedAt): 2. Orphan broker leak on age rotation (P1 — killProcess default): Caller opt-out still works — any caller that explicitly passes No test harness exists for broker lifecycle in this repo that I found, so behavioral verification was a manual repro against the v1.0.2 plugin cache (spawn broker → age past threshold → next ensureBrokerSession → observe PID terminates via SIGTERM + no orphaned |
|
Note To use Codex here, create an environment for this repo. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ce0540dce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Use defaultKillProcess so stale (but PID-alive) brokers actually | ||
| // terminate during rotation — otherwise they leak as orphaned | ||
| // processes while only their socket/pidfile are cleaned up. | ||
| killProcess: options.killProcess ?? defaultKillProcess |
There was a problem hiding this comment.
Verify PID ownership before terminating stale sessions
ensureBrokerSession now always supplies defaultKillProcess when tearing down any existing session, which means a stale broker.json can cause SIGTERM to be sent to whatever process currently owns that PID. If the broker exited earlier and the OS recycled its PID, the next endpoint-rotation path will terminate an unrelated local process. This regression is introduced by the new default kill behavior and needs an identity check (or stronger PID source) before signaling.
Useful? React with 👍 / 👎.
|
Good catch — fixed in the same amended commit. Added
If any check fails (file removed, different PID, unreadable) we fall through to Invariant: the broker writes its pid-file before becoming discoverable, and on normal exit the file is removed by |
6ce0540 to
b882f90
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b882f901cf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try { | ||
| if (!fs.existsSync(session.pidFile)) return false; | ||
| const content = fs.readFileSync(session.pidFile, "utf8").trim(); | ||
| return Number(content) === session.pid; |
There was a problem hiding this comment.
Strengthen PID ownership check before SIGTERM
verifyBrokerPid() treats Number(readFile(pidFile)) === session.pid as proof of ownership, but that check is insufficient once PIDs are recycled: if the broker crashed ungracefully, the stale pid file can remain and later match an unrelated process that reuses the same PID, and this commit now uses that result to enable defaultKillProcess during teardown. Fresh evidence: app-server-broker.mjs only deletes the pid file in shutdown(), so crash paths can leave stale pid files behind. In that state, a failed endpoint probe will trigger SIGTERM against a non-broker process.
Useful? React with 👍 / 👎.
b882f90 to
adb220e
Compare
|
Fair — pid-file alone is not enough because Strengthened in amended commit: now
Any failure → return This closes the stale-pid-file + PID-recycle hole. If maintainers prefer a different identity source (e.g. read |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: adb220edde
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // behavior — only socket/pidfile cleanup, no signal). | ||
| function verifyBrokerPid(session) { | ||
| if (!session || !Number.isFinite(session.pid) || !session.pidFile) return false; | ||
| if (process.platform === "win32") return false; |
There was a problem hiding this comment.
Terminate stale brokers on Windows rotations
verifyBrokerPid() unconditionally returns false on Windows, so ensureBrokerSession() never falls back to defaultKillProcess during stale-session teardown. In this commit, stale sessions are now rotated by age, so a healthy-but-old broker on Windows gets its files/session record deleted and a new broker spawned while the previous detached broker keeps running. Because each new session uses a different pipe path, this leak is silent and can accumulate over long-running usage.
Useful? React with 👍 / 👎.
adb220e to
9167f2b
Compare
|
Added Windows branch so age-based rotation doesn't silently leak brokers there. New Windows path uses
Then the existing This is weaker than the POSIX If a stronger Windows identity check is required, the repo would need a new per-broker token file (e.g. a random UUID broker writes on spawn that survives crashes unchanged). I left that out of this PR to keep scope tight; happy to follow up with a separate PR if maintainers want that level. Fallback on any |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9167f2b056
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const age = Date.now() - session.startedAt; | ||
| if (age > getBrokerMaxAgeMs()) return true; |
There was a problem hiding this comment.
Avoid age-only rotation while broker may be serving requests
isSessionStale() marks a session stale solely from elapsed age, and ensureBrokerSession() immediately tears that session down (including SIGTERM when PID verification passes). In the concurrent-client case (broker older than threshold, one client currently mid-turn, second client connecting), this path can kill the shared broker and drop the in-flight turn instead of returning the existing busy response path. Because app-server-broker.mjs exits on SIGTERM, this introduces user-visible task interruptions once the broker passes max age.
Useful? React with 👍 / 👎.
| if (!result || !result.trim()) return false; | ||
| return /(^|")node\.exe(?:"|,)/i.test(result); |
There was a problem hiding this comment.
Tighten Windows PID ownership check before SIGTERM
On Windows, verifyBrokerPid() currently accepts ownership when tasklist shows the PID is node.exe; fresh evidence in this revision is that the new branch returns true on that image-name match alone. If a stale pid file survives a broker crash and the OS recycles that PID to an unrelated Node process, teardown will still call defaultKillProcess and terminate the wrong process. This check needs stronger identity validation (for example full command line) or should avoid kill when ownership cannot be proven.
Useful? React with 👍 / 👎.
9167f2b to
fc6af22
Compare
|
Dropped both problematic parts of the previous revision. Age-based rotation removed. You were right — rotating a healthy-but-aged broker while a concurrent client is mid-turn interrupts that turn because Windows kill attempt removed. What remains in this PR (the narrow, defensible core):
Commit message rewritten to match the scope. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc6af220e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| encoding: "utf8", | ||
| timeout: 1000 | ||
| }); | ||
| return cmd.includes("app-server-broker.mjs"); |
There was a problem hiding this comment.
Match broker command args before sending SIGTERM
Fresh evidence in this revision is that ownership is now accepted solely by cmd.includes("app-server-broker.mjs"), which is still too broad once PID reuse occurs. If a crashed session leaves a stale pid file and that PID is later recycled to a different broker instance (for another workspace), this check passes and ensureBrokerSession() will route teardown through defaultKillProcess, terminating an unrelated live broker. Because default teardown previously did not kill by default, this introduces a new cross-session kill regression; the verification should also match instance-specific arguments like --pid-file or --endpoint.
Useful? React with 👍 / 👎.
fc6af22 to
7e0123f
Compare
isBrokerEndpointReady() only pings the socket for 150ms. If the broker process has crashed but its socket file lingers (unix domain) or the listener drops without the probe noticing, the existing session is trusted and reused — but every downstream task disconnects mid-turn because the transport subsystem behind the socket is actually gone. Add isPidAlive() check consulted before trusting the socket ping. If the PID is dead, tear down and respawn. Safety around the teardown's SIGTERM: - verifyBrokerPid() cross-checks session.pid against the on-disk pid-file contents AND the live process command line via `ps` (POSIX) before returning true. - Windows intentionally returns false — tasklist exposes image name but not command line, and matching node.exe alone is too weak to rule out recycled-PID foreign processes. Windows rotation still cleans socket/pidfile; detached old broker eventually exits on its own since no new client reaches it. - If verifyBrokerPid() returns false (e.g. stale pid-file, PID gone, ps lookup fails), killProcess falls back to null — no signal, only file cleanup, same as trunk behavior. Age-based rotation for healthy-but-degrading brokers was considered and dropped in this revision: rotating a still-serving broker can interrupt a concurrent client's in-flight turn. A proper fix needs an active health probe (e.g. lightweight RPC round-trip) or graceful drain. Out of scope for this PR; filed as a follow-up.
7e0123f to
856fbd8
Compare
|
Good catch @chatgpt-codex-connector — valid P1, the script-name substring alone was too broad and would cross-kill a foreign broker after PID recycle. Tightened in amended commit ( if (!cmd.includes("app-server-broker.mjs")) return false;
if (!cmd.includes(`--pid-file ${session.pidFile}`)) return false;
if (session.endpoint && !cmd.includes(`--endpoint ${session.endpoint}`)) return false;
return true;Why this is instance-specific:
If maintainers prefer something stronger (e.g. a broker-written identity token file separate from |
|
Note To use Codex here, create an environment for this repo. |
Summary
ensureBrokerSession()trusts any existing broker whose endpoint passes a 150ms socket ping. When the broker's underlyingcodex app-serversubprocess is in a bad state but the socket accept-loop still works, every downstream task disconnects mid-turn — yet on each call the helper reuses the same broker instead of spawning a fresh one.Observed in the wild: a broker process that had been up 3 days 11 hours caused 100% task failure across many invocations. Deleting
broker.json(forcing a fresh broker on the next task) immediately restored reliability.Fix
Add
isSessionStale()with two complementary checks, consulted before trusting the socket ping:process.kill(pid, 0)) — catches the acute case where the broker has crashed but the socket file lingers.session.startedAt(new field) and forces a fresh broker past a threshold. Default 6h, overridable viaCODEX_COMPANION_BROKER_MAX_AGE_HOURSenv var. Catches the slow-degradation case neither PID nor socket ping surface.If
isSessionStale()returns true, the existing session flows through the same teardown path already used for missing/unreachable brokers — a fresh broker spawns.Test plan
node --checkon modified fileNotes on the default
6 hours is intuition-based, not data-driven — we didn't have a reliable timeline on when the codex app-server starts degrading. Env override is deliberate so operators can tune without a code change. Happy to switch to a more data-backed default if maintainers have telemetry.
Related PRs (independent)
fix(captureTurn): race state.completion with exitPromise to surface transport close— surfaces the failure mode when broker proxy dropsfix(plugin): upgrade --write sandbox to danger-full-access so MCP tool calls work— unrelated sandbox/MCP issue