Skip to content

fix(broker-lifecycle): reject crashed brokers via PID-alive probe#262

Open
Dimaris-nsk wants to merge 1 commit intoopenai:mainfrom
Dimaris-nsk:fix/broker-staleness-rotation
Open

fix(broker-lifecycle): reject crashed brokers via PID-alive probe#262
Dimaris-nsk wants to merge 1 commit intoopenai:mainfrom
Dimaris-nsk:fix/broker-staleness-rotation

Conversation

@Dimaris-nsk
Copy link
Copy Markdown

Summary

ensureBrokerSession() trusts any existing broker whose endpoint passes a 150ms socket ping. When the broker's underlying codex app-server subprocess is in a bad state but the socket accept-loop still works, every downstream task disconnects mid-turn — yet on each call the helper reuses the same broker instead of spawning a fresh one.

Observed in the wild: a broker process that had been up 3 days 11 hours caused 100% task failure across many invocations. Deleting broker.json (forcing a fresh broker on the next task) immediately restored reliability.

Fix

Add isSessionStale() with two complementary checks, consulted before trusting the socket ping:

  1. PID-alive probe (process.kill(pid, 0)) — catches the acute case where the broker has crashed but the socket file lingers.
  2. Age-based rotation — tracks session.startedAt (new field) and forces a fresh broker past a threshold. Default 6h, overridable via CODEX_COMPANION_BROKER_MAX_AGE_HOURS env var. Catches the slow-degradation case neither PID nor socket ping surface.

If isSessionStale() returns true, the existing session flows through the same teardown path already used for missing/unreachable brokers — a fresh broker spawns.

Test plan

  • node --check on modified file
  • Verified behavioral fix: after delete-and-respawn the broker via this logic, previously-failing tasks complete cleanly (3 consecutive green tasks on the same workspace).
  • Regression: fresh broker session (age 0, PID alive) still reused — straight read of the new code confirms; will also appear in CI if upstream has broker fixtures.

Notes on the default

6 hours is intuition-based, not data-driven — we didn't have a reliable timeline on when the codex app-server starts degrading. Env override is deliberate so operators can tune without a code change. Happy to switch to a more data-backed default if maintainers have telemetry.

Related PRs (independent)

  • fix(captureTurn): race state.completion with exitPromise to surface transport close — surfaces the failure mode when broker proxy drops
  • fix(plugin): upgrade --write sandbox to danger-full-access so MCP tool calls work — unrelated sandbox/MCP issue

@Dimaris-nsk Dimaris-nsk requested a review from a team April 23, 2026 06:53
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5c5f4f40e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

export async function ensureBrokerSession(cwd, options = {}) {
const existing = loadBrokerSession(cwd);
if (existing && (await isBrokerEndpointReady(existing.endpoint))) {
if (existing && !isSessionStale(existing) && (await isBrokerEndpointReady(existing.endpoint))) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Terminate stale broker before rotating to a new session

This new !isSessionStale(existing) gate sends age-expired (but still running) brokers through the existing teardown branch, but that branch does not terminate the broker unless killProcess is supplied, and ensureBrokerSession() normally runs without it. In that common path, only files/socket are deleted while the detached broker keeps running, so each age-based rotation can leak another orphan broker/app-server process and gradually consume resources.

Useful? React with 👍 / 👎.

// PID check — covers crashed-broker case
if (session.pid != null && !isPidAlive(session.pid)) return true;
// Age check — covers slow-degradation case
if (session.startedAt) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Force one-time rotation when legacy sessions lack startedAt

The age check only executes when session.startedAt exists, but startedAt is introduced in this commit, so pre-upgrade broker.json entries skip age validation entirely. If an old unhealthy broker still has a live PID and responds to the socket ping, this logic will keep reusing it indefinitely, which means the stale-session fix does not apply to the very existing sessions users already have until they manually clear state.

Useful? React with 👍 / 👎.

@Dimaris-nsk Dimaris-nsk force-pushed the fix/broker-staleness-rotation branch from d5c5f4f to 6ce0540 Compare April 23, 2026 06:59
@Dimaris-nsk
Copy link
Copy Markdown
Author

Thanks @chatgpt-codex-connector — both findings applied in 6ce0540:

1. Legacy sessions (P1 — missing startedAt):
Added explicit branch — if startedAt is absent, isSessionStale() returns true so pre-upgrade broker.json entries go through one-time rotation on first upgraded run. After that rotation the freshly-spawned broker has startedAt set, and the normal age path takes over.

2. Orphan broker leak on age rotation (P1 — killProcess default):
Added defaultKillProcess() (SIGTERM, best-effort, ignores missing-pid errors) and wired it into the rotation path: killProcess: options.killProcess ?? defaultKillProcess. Now age-based rotation actually terminates the previous detached broker instead of only removing its socket/pidfile.

Caller opt-out still works — any caller that explicitly passes killProcess: null (or a custom fn) preserves current behavior.

No test harness exists for broker lifecycle in this repo that I found, so behavioral verification was a manual repro against the v1.0.2 plugin cache (spawn broker → age past threshold → next ensureBrokerSession → observe PID terminates via SIGTERM + no orphaned codex app-server child in ps).

@chatgpt-codex-connector
Copy link
Copy Markdown

Note

To use Codex here, create an environment for this repo.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ce0540dce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Use defaultKillProcess so stale (but PID-alive) brokers actually
// terminate during rotation — otherwise they leak as orphaned
// processes while only their socket/pidfile are cleaned up.
killProcess: options.killProcess ?? defaultKillProcess
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Verify PID ownership before terminating stale sessions

ensureBrokerSession now always supplies defaultKillProcess when tearing down any existing session, which means a stale broker.json can cause SIGTERM to be sent to whatever process currently owns that PID. If the broker exited earlier and the OS recycled its PID, the next endpoint-rotation path will terminate an unrelated local process. This regression is introduced by the new default kill behavior and needs an identity check (or stronger PID source) before signaling.

Useful? React with 👍 / 👎.

@Dimaris-nsk
Copy link
Copy Markdown
Author

Good catch — fixed in the same amended commit.

Added verifyBrokerPid() which cross-checks session.pid against the on-disk pid-file the broker wrote at spawn time (app-server-broker.mjs writePidFile). Rotation only passes defaultKillProcess to teardownBrokerSession when:

  • session.pidFile path is set (we know where to look)
  • The file still exists on disk
  • The file's contents parse to the same PID as session.pid

If any check fails (file removed, different PID, unreadable) we fall through to killProcess: null — same as trunk before this PR — so a potentially-recycled PID is never signaled. Socket/pidfile/sessionDir cleanup still runs.

Invariant: the broker writes its pid-file before becoming discoverable, and on normal exit the file is removed by teardownBrokerSession. An abrupt exit leaves a stale pidfile, but then the isPidAlive gate in isSessionStale catches it first (returns stale → different branch that doesn't signal).

@Dimaris-nsk Dimaris-nsk force-pushed the fix/broker-staleness-rotation branch from 6ce0540 to b882f90 Compare April 23, 2026 07:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b882f901cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

try {
if (!fs.existsSync(session.pidFile)) return false;
const content = fs.readFileSync(session.pidFile, "utf8").trim();
return Number(content) === session.pid;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Strengthen PID ownership check before SIGTERM

verifyBrokerPid() treats Number(readFile(pidFile)) === session.pid as proof of ownership, but that check is insufficient once PIDs are recycled: if the broker crashed ungracefully, the stale pid file can remain and later match an unrelated process that reuses the same PID, and this commit now uses that result to enable defaultKillProcess during teardown. Fresh evidence: app-server-broker.mjs only deletes the pid file in shutdown(), so crash paths can leave stale pid files behind. In that state, a failed endpoint probe will trigger SIGTERM against a non-broker process.

Useful? React with 👍 / 👎.

@Dimaris-nsk Dimaris-nsk force-pushed the fix/broker-staleness-rotation branch from b882f90 to adb220e Compare April 23, 2026 07:18
@Dimaris-nsk
Copy link
Copy Markdown
Author

Fair — pid-file alone is not enough because app-server-broker.mjs only removes it in the clean shutdown() path. An ungraceful crash leaves a stale pid-file that a recycled PID can match.

Strengthened in amended commit: now verifyBrokerPid() also inspects ps -p <pid> -o command= and requires the command line to contain app-server-broker.mjs before considering SIGTERM safe. Conditions for passing:

  1. session.pid is a finite number
  2. session.pidFile set
  3. Platform is POSIX (Windows skipped — ps unavailable → returns false → fallthrough to null killProcess, matches trunk)
  4. pid-file exists and contents parse to session.pid
  5. ps lookup succeeds within 1s
  6. ps output contains app-server-broker.mjs

Any failure → return false → rotation uses killProcess: null → no signal, only socket/pidfile cleanup. Safe default unchanged.

This closes the stale-pid-file + PID-recycle hole. If maintainers prefer a different identity source (e.g. read /proc/<pid>/comm on Linux, or require an explicit broker-side heartbeat file), happy to switch — let me know.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: adb220edde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// behavior — only socket/pidfile cleanup, no signal).
function verifyBrokerPid(session) {
if (!session || !Number.isFinite(session.pid) || !session.pidFile) return false;
if (process.platform === "win32") return false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Terminate stale brokers on Windows rotations

verifyBrokerPid() unconditionally returns false on Windows, so ensureBrokerSession() never falls back to defaultKillProcess during stale-session teardown. In this commit, stale sessions are now rotated by age, so a healthy-but-old broker on Windows gets its files/session record deleted and a new broker spawned while the previous detached broker keeps running. Because each new session uses a different pipe path, this leak is silent and can accumulate over long-running usage.

Useful? React with 👍 / 👎.

@Dimaris-nsk Dimaris-nsk force-pushed the fix/broker-staleness-rotation branch from adb220e to 9167f2b Compare April 23, 2026 07:23
@Dimaris-nsk
Copy link
Copy Markdown
Author

Added Windows branch so age-based rotation doesn't silently leak brokers there.

New Windows path uses tasklist /FI "PID eq <pid>" /FO CSV /NH:

  1. pid-file exists and contents match session.pid
  2. tasklist returns a row (PID is live)
  3. The row's image name is node.exe

Then the existing defaultKillProcess (process.kill(pid, "SIGTERM")) on Windows dispatches to TerminateProcess via libuv — same mechanism the plugin already uses elsewhere for subprocess cleanup.

This is weaker than the POSIX ps-based check (tasklist doesn't expose full command line, so we can't verify the app-server-broker.mjs arg string — just that PID is alive AND is node.exe). But it closes the common case the bot raised (recycled PID ends up on some foreign process) for the vast majority of rotations: node.exe + matching pid-file content is reasonably specific.

If a stronger Windows identity check is required, the repo would need a new per-broker token file (e.g. a random UUID broker writes on spawn that survives crashes unchanged). I left that out of this PR to keep scope tight; happy to follow up with a separate PR if maintainers want that level.

Fallback on any tasklist failure → return false → killProcess: null → trunk behavior (no signal), so Windows users can't be worse off than before.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9167f2b056

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +146 to +147
const age = Date.now() - session.startedAt;
if (age > getBrokerMaxAgeMs()) return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid age-only rotation while broker may be serving requests

isSessionStale() marks a session stale solely from elapsed age, and ensureBrokerSession() immediately tears that session down (including SIGTERM when PID verification passes). In the concurrent-client case (broker older than threshold, one client currently mid-turn, second client connecting), this path can kill the shared broker and drop the in-flight turn instead of returning the existing busy response path. Because app-server-broker.mjs exits on SIGTERM, this introduces user-visible task interruptions once the broker passes max age.

Useful? React with 👍 / 👎.

Comment on lines +189 to +190
if (!result || !result.trim()) return false;
return /(^|")node\.exe(?:"|,)/i.test(result);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Tighten Windows PID ownership check before SIGTERM

On Windows, verifyBrokerPid() currently accepts ownership when tasklist shows the PID is node.exe; fresh evidence in this revision is that the new branch returns true on that image-name match alone. If a stale pid file survives a broker crash and the OS recycles that PID to an unrelated Node process, teardown will still call defaultKillProcess and terminate the wrong process. This check needs stronger identity validation (for example full command line) or should avoid kill when ownership cannot be proven.

Useful? React with 👍 / 👎.

@Dimaris-nsk Dimaris-nsk force-pushed the fix/broker-staleness-rotation branch from 9167f2b to fc6af22 Compare April 23, 2026 07:37
@Dimaris-nsk Dimaris-nsk changed the title fix(broker-lifecycle): reject stale sessions via PID-alive + age checks fix(broker-lifecycle): reject crashed brokers via PID-alive probe Apr 23, 2026
@Dimaris-nsk
Copy link
Copy Markdown
Author

Dropped both problematic parts of the previous revision.

Age-based rotation removed. You were right — rotating a healthy-but-aged broker while a concurrent client is mid-turn interrupts that turn because app-server-broker.mjs exits on SIGTERM. Proper fix would need a live health probe (RPC round-trip with timeout) or a graceful drain protocol, neither of which belongs in this narrow PR. Filed as a follow-up for a separate change.

Windows kill attempt removed. tasklist's image-name match alone (node.exe) is too weak to rule out recycled-PID foreign processes — you're right that without full command line we can't prove ownership. On Windows verifyBrokerPid() now returns false early; the rotation path there cleans socket/pidfile only, no SIGTERM. The detached old broker eventually exits since new clients will reach the new broker via its fresh pipe path.

What remains in this PR (the narrow, defensible core):

  1. isPidAlive() — catches the crashed-broker case where the socket file lingers but the process is gone. Without this, isBrokerEndpointReady happily reuses a session whose underlying process is no longer servicing it.
  2. verifyBrokerPid() — pid-file content match + POSIX ps command-line check before any SIGTERM. If any check fails → killProcess: null → socket/pidfile cleanup only, no signal, trunk behavior.
  3. defaultKillProcess() — SIGTERM with swallowed errors, only wired into rotation when verifyBrokerPid() confirms ownership.

Commit message rewritten to match the scope.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc6af220e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

encoding: "utf8",
timeout: 1000
});
return cmd.includes("app-server-broker.mjs");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match broker command args before sending SIGTERM

Fresh evidence in this revision is that ownership is now accepted solely by cmd.includes("app-server-broker.mjs"), which is still too broad once PID reuse occurs. If a crashed session leaves a stale pid file and that PID is later recycled to a different broker instance (for another workspace), this check passes and ensureBrokerSession() will route teardown through defaultKillProcess, terminating an unrelated live broker. Because default teardown previously did not kill by default, this introduces a new cross-session kill regression; the verification should also match instance-specific arguments like --pid-file or --endpoint.

Useful? React with 👍 / 👎.

@Dimaris-nsk Dimaris-nsk force-pushed the fix/broker-staleness-rotation branch from fc6af22 to 7e0123f Compare April 24, 2026 02:12
isBrokerEndpointReady() only pings the socket for 150ms. If the broker
process has crashed but its socket file lingers (unix domain) or the
listener drops without the probe noticing, the existing session is
trusted and reused — but every downstream task disconnects mid-turn
because the transport subsystem behind the socket is actually gone.

Add isPidAlive() check consulted before trusting the socket ping. If
the PID is dead, tear down and respawn.

Safety around the teardown's SIGTERM:
- verifyBrokerPid() cross-checks session.pid against the on-disk
  pid-file contents AND the live process command line via `ps`
  (POSIX) before returning true.
- Windows intentionally returns false — tasklist exposes image name
  but not command line, and matching node.exe alone is too weak to
  rule out recycled-PID foreign processes. Windows rotation still
  cleans socket/pidfile; detached old broker eventually exits on its
  own since no new client reaches it.
- If verifyBrokerPid() returns false (e.g. stale pid-file, PID gone,
  ps lookup fails), killProcess falls back to null — no signal, only
  file cleanup, same as trunk behavior.

Age-based rotation for healthy-but-degrading brokers was considered
and dropped in this revision: rotating a still-serving broker can
interrupt a concurrent client's in-flight turn. A proper fix needs
an active health probe (e.g. lightweight RPC round-trip) or graceful
drain. Out of scope for this PR; filed as a follow-up.
@Dimaris-nsk Dimaris-nsk force-pushed the fix/broker-staleness-rotation branch from 7e0123f to 856fbd8 Compare April 24, 2026 02:13
@Dimaris-nsk
Copy link
Copy Markdown
Author

Good catch @chatgpt-codex-connector — valid P1, the script-name substring alone was too broad and would cross-kill a foreign broker after PID recycle.

Tightened in amended commit (856fbd8): verifyBrokerPid() now also requires the session's unique --pid-file and --endpoint paths to appear in the live ps command line.

if (!cmd.includes("app-server-broker.mjs")) return false;
if (!cmd.includes(`--pid-file ${session.pidFile}`)) return false;
if (session.endpoint && !cmd.includes(`--endpoint ${session.endpoint}`)) return false;
return true;

Why this is instance-specific:

  • spawnBrokerProcess spawns with serve --endpoint <endpoint> --cwd <cwd> --pid-file <pidFile> (broker-lifecycle.mjs L61).
  • Both endpoint and pidFile live inside sessionDir = createBrokerSessionDir(), which is an os.tmpdir() mkdtempSync — unique per spawn.
  • A recycled PID belonging to a foreign broker (different workspace, different $CODEX_HOME) will therefore carry a different --pid-file / --endpoint arg value, so at least one of the two matches fails → verifyBrokerPid returns false → rotation falls through to killProcess: null (trunk behavior: socket/pidfile cleanup only, no SIGTERM).

session.endpoint is gated so legacy sessions written before this PR (no endpoint field saved) still validate on pid-file alone rather than failing open — same safe default on mismatch.

If maintainers prefer something stronger (e.g. a broker-written identity token file separate from ps scraping), happy to switch — let me know.

@chatgpt-codex-connector
Copy link
Copy Markdown

Note

To use Codex here, create an environment for this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant