Skip to content

[Bugfix #905] Replay EXIT frame to clients that connect after PTY exit#953

Open
mohidmakhdoomi wants to merge 11 commits into
cluesmith:mainfrom
mohidmakhdoomi:builder/bugfix-905
Open

[Bugfix #905] Replay EXIT frame to clients that connect after PTY exit#953
mohidmakhdoomi wants to merge 11 commits into
cluesmith:mainfrom
mohidmakhdoomi:builder/bugfix-905

Conversation

@mohidmakhdoomi
Copy link
Copy Markdown
Contributor

Fixes #905

Root cause

The 3 session-manager integration tests guarded by it.skipIf(!!process.env.CI) were timing out locally (15s each), not on CI (where they're skipped). Investigation found a genuine race condition in the shellper, not an environmental/test-only flake:

  • ShellperProcess broadcasts the EXIT frame only to clients connected at the moment the PTY exits (pty.onExitbroadcast).
  • handleHello sent WELCOME + REPLAY to a connecting client but never an EXIT frame, even when the PTY had already exited.
  • SessionManager.createSession connects its client only after spawn → read info → waitForSocket → connect. A fast-exiting command (/bin/sh -c 'exit 1', exit 0) finishes before that connect completes, so the EXIT broadcast reaches nobody. The client then hangs forever waiting for an exit event → session-exit / session-error never fires → 15s test timeout.
  • WSL / slow disk widens the spawn→connect window, which is why it reproduced locally. On faster machines the client usually wins the race, so the tests "passed" (when not skipped) — i.e. they were latently flaky.

The original skipIf(CI) (Spec 0104) was added for an unrelated reason: node-pty native module resolution fails in child node processes on GitHub Actions. That rationale is independent of this race and still valid, so the CI skip is left in place.

Fix

packages/codev/src/terminal/shellper-process.ts:

  • Retain the PTY's exit info (exitInfo) when it exits; reset it on every (re)spawn.
  • In handleHello, after WELCOME + REPLAY, if the PTY has already exited, replay the EXIT frame to the late-connecting client so it doesn't hang. No double-send: the broadcast only reaches already-connected clients, the replay only the late one.

Regression test

shellper-process.test.ts: exit the PTY before any client connects, then connect and assert the client still receives an EXIT frame.

Results

  • The 3 affected tests: 45.98s of timeouts → all pass in <2s (and now actually run locally instead of hanging).
  • Full pnpm build + full unit suite: 152 files / 3211 tests passed, 13 skipped, 0 failed.

Test plan

  • Reproduced the 15s timeout locally with dist/ built
  • Verified node-pty itself works locally (ruled out as cause)
  • Confirmed fix makes the 3 tests pass and added a focused regression test
  • Full build + full unit suite green

CMAP Review

🤖 Generated with Claude Code

mohidmakhdoomi and others added 5 commits May 31, 2026 16:26
… after PTY exit

Fast-exiting commands (e.g. `exit 1`) could finish before SessionManager
connected its client. The shellper broadcast the EXIT frame only to clients
connected at exit time, so the late-connecting client never learned the
session ended and hung until the 15s test timeout. WSL/slow disk widened the
race, which is why the session-manager integration tests timed out locally
while passing (when not skipped) elsewhere.

Retain exit info after the PTY exits and replay an EXIT frame to any client
that connects post-exit (during the HELLO/WELCOME handshake). Adds a
regression test that exits the PTY before connecting and asserts the client
still receives an EXIT frame.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mohidmakhdoomi
Copy link
Copy Markdown
Contributor Author

CMAP Review (3-way)

Model Verdict Confidence Key issues
Gemini ✅ APPROVE HIGH None
Codex ✅ APPROVE MEDIUM None
Claude ⚠️ n/a Lane hit usage limit (rate-limited, not a code finding)

Gemini: "The fix elegantly resolves the race condition by retaining and replaying the PTY exit frame for late-connecting clients, supported by a solid and focused regression test."

Codex: "Focused fix for the late-connecting EXIT race, with an appropriate regression test and solid BUGFIX PR hygiene." (Codex could not independently re-run Vitest — its sandbox is read-only/EROFS; tests were verified locally by the builder: full build + 3211 unit tests green.)

No REQUEST_CHANGES. No code changes required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mohidmakhdoomi
Copy link
Copy Markdown
Contributor Author

Architect Integration Review — CI follow-up (do NOT merge yet)

The code fix is correct and approved on its merits, but CI's Tower Integration Tests job is failing and this is a regression introduced by this PR (main's last 3 runs of that job are green).

Root cause (regression from the EXIT-replay fix)

  • Failing test: src/agent-farm/__tests__/send-integration.e2e.test.tsafterAll hook times out at 10s.
  • afterAll (10s budget) deactivates workspace A (2 builder terminals) then B (1). Each deactivation runs stopInstance, which calls waitForTerminalExit(manager, terminalId) per terminal (tower-instances.ts:781/791/801).
  • Before this PR, late-connecting Tower clients never received EXIT, so sessions stayed status: 'running' — a different bug, but it didn't trip this path the same way.
  • After this PR, EXIT is replayed on connect, so each PtySession fires its 'exit' event early — before teardown attaches session.once('exit'). once only catches future emissions, so the already-fired exit is missed and waitForTerminalExit waits out the full 5s safety timeout. A-then-B sequential ≈ 10s → hook timeout.

Fix (per Codex, validated against the code)

In waitForTerminalExit (packages/codev/src/agent-farm/servers/tower-instances.ts:123), short-circuit when the session has already exited, before attaching the once('exit') listener:

const session = manager.getSession(terminalId);
if (!session || typeof (session as { once?: unknown }).once !== 'function') {
  return Promise.resolve();
}
// The 'exit' event already fired and won't fire again — don't wait out the
// 5s safety timeout (Bugfix #905 made exits propagate before teardown attaches).
if ((session as { status?: string }).status === 'exited') {
  return Promise.resolve();
}

session.status is the real getter at pty-session.ts:386 ('running' | 'exited'), so this is safe.

Asks

  1. Apply the fix above.
  2. Add a small unit test for waitForTerminalExit: an already-exited session resolves immediately (no 5s wait) — guards the regression directly, faster than the e2e.
  3. Re-run the full build + unit tests, and confirm the Tower Integration Tests job goes green in CI.
  4. Report back at the pr gate — do not merge until CI is green and I re-approve.

Architect integration review

mohidmakhdoomi and others added 4 commits May 31, 2026 20:11
…eady-exited sessions

The EXIT-replay fix made shellper exits propagate before teardown attaches
its once('exit') listener. waitForTerminalExit attached the listener after
the event had already fired, so an already-exited session waited out the full
5s safety timeout — A-then-B sequential teardown (3 terminals) overran the
send-integration e2e afterAll 10s budget in CI.

Short-circuit on session.status === 'exited' before attaching the listener.
Exports waitForTerminalExit and adds focused unit tests (already-exited
resolves immediately without attaching the listener; missing session resolves;
running session still resolves on the exit event).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mohidmakhdoomi
Copy link
Copy Markdown
Contributor Author

Architect Integration Review — APPROVED ✅

pr gate approved (human-explicit). Cleared to merge.

Sign-off:

  • CI: all 6 jobs green, including the previously-red Tower Integration Tests (the waitForTerminalExit short-circuit resolved the afterAll teardown timeout).
  • Gemini (integration, on HEAD): APPROVE — HIGH (verified the short-circuit + regression test directly).
  • Codex (integration): APPROVE — HIGH (no integration or security concerns; teardown latency no longer scales with already-exited terminals).
  • Architect: approve. This is a genuine product-bug fix (late-connect EXIT race), not a re-skip — exactly what Investigate flaky session-manager integration tests (formerly skipIf-CI) #905 asked for. The follow-on waitForTerminalExit fix is correct and unit-guarded.

Note: the first Gemini pass returned REQUEST_CHANGES, but it reviewed a stale view and requested the exact fix already present in this PR; the re-run on HEAD confirmed APPROVE.


Architect integration review

…protection blocker)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate flaky session-manager integration tests (formerly skipIf-CI)

1 participant