Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
id: bugfix-905
title: investigate-flaky-session-mana
protocol: bugfix
phase: pr
plan_phases: []
current_plan_phase: null
gates:
pr:
status: approved
requested_at: '2026-06-01T00:02:25.837Z'
approved_at: '2026-06-01T00:25:20.957Z'
iteration: 1
build_complete: false
history: []
started_at: '2026-05-31T20:26:53.743Z'
updated_at: '2026-06-01T00:25:20.958Z'
pr_ready_for_human: false
79 changes: 79 additions & 0 deletions codev/state/bugfix-905_thread.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# bugfix-905: Investigate flaky session-manager integration tests

## Investigate phase

### Findings so far
- 3 affected tests in `packages/codev/src/terminal/__tests__/session-manager.test.ts`:
- `respects maxRestarts limit` (~1020)
- `logs session exit without stderr tail (stderr goes to file)` (~1724)
- `no stderr tail logged for file-based stderr (Bugfix #324)` (~1834)
- All integration tests spawn a real shellper via `dist/terminal/shellper-main.js`
(resolved relative to the test file: `../../../dist/terminal/shellper-main.js`).
- **Original `skipIf(CI)` intent** (commit fd012108e, Spec 0104): skipped in CI because
node-pty native module resolution fails in child processes on GitHub Actions. Expected
to run locally.

### Leading hypothesis
The default `pnpm test` (`vitest`) does NOT build `dist/` first (only `test:e2e`/`test:e2e:cli`
run `pnpm build`). In a fresh worktree `dist/terminal/shellper-main.js` is absent, so spawning
the shellper fails / never emits `session-exit` → 15s test timeout. Root cause = missing build
artifact dependency, not a deadlock in the code under test.

Verifying: reproduce timeout w/o dist, then build dist and re-run.

### ROOT CAUSE (confirmed) — not missing dist
- Without dist: tests fail fast (~120ms) with "Invalid shellper info JSON" (spawn can't find script).
- With dist built: the 3 tests genuinely time out at 15s each (45.98s total). Real bug, env-amplified.
- node-pty itself works fine locally (direct spawn test passes).
- **The bug**: `ShellperProcess` broadcasts the EXIT frame only to clients connected
*at exit time* (`shellper-process.ts` pty.onExit → broadcast). `handleHello` sent
WELCOME + REPLAY but never an EXIT frame. Fast-exiting commands (`exit 1`/`exit 0`)
finish before `SessionManager.createSession` connects its client (spawn → read info →
waitForSocket → connect), so the EXIT broadcast reaches nobody and the client hangs
→ `session-exit`/`session-error` never fires → 15s timeout. WSL/slow disk widens the race.
- Why local-only & flaky: CI skipped these (node-pty child-proc resolution); locally they
were racy and lost the race on slow filesystems.

### FIX (`shellper-process.ts`)
- Retain `exitInfo` when the PTY exits; reset on (re)spawn.
- In `handleHello`, after WELCOME+REPLAY, if already exited, replay the EXIT frame to the
late-connecting client. No double-send (broadcast only hits already-connected clients).
- Regression test in `shellper-process.test.ts`: exit BEFORE connect, assert client still
gets EXIT frame.
- Kept `skipIf(CI)` — node-pty-on-CI rationale is independent of this race and still valid.

### Results
- 3 affected tests: 45.98s timeout → all pass in <2s.
- Full shellper-process + session-manager suites: 114 passed, 0 failed (18.72s).
- After clean `pnpm install && pnpm build`: full unit suite 152 files / 3211 passed, 13 skipped, 0 failed.

## PR phase
- PR #953 → cluesmith/codev (cross-fork from mohidmakhdoomi:builder/bugfix-905).
Note: `origin` has split URLs (fetch=cluesmith, push=mohidmakhdoomi fork); branch lives on the fork.
- CMAP-3 (--issue 905): Gemini APPROVE (HIGH), Codex APPROVE (MEDIUM, couldn't re-run vitest in RO sandbox).
Claude lane hit a usage limit (rate-limited, not a finding). No REQUEST_CHANGES.
- porch: investigate → fix → pr done; PR gate requested.

## CI regression follow-up (architect-flagged)
- Tower Integration Tests job went red: `send-integration.e2e.test.ts` afterAll (10s) timed out.
- Cause: the EXIT-replay fix makes shellper exits propagate *earlier*. `waitForTerminalExit`
(`tower-instances.ts:123`) attached `once('exit')` *after* the event had already fired, so an
already-exited session waited out the full 5s safety timeout × N terminals (A then B ≈ 10s).
- Fix: short-circuit `waitForTerminalExit` when `session.status === 'exited'` before attaching the
listener. Exported the fn + added 3 focused unit tests in `tower-instances.test.ts`.
- Could NOT reproduce the e2e locally: `registerTerminal` returns 500 in this sandbox during
`beforeAll` (real-shellper spawn fails — same node-pty-in-child-process limitation that gates
these CI-tier tests). My change only touches teardown, so it can't cause a beforeAll 500.
- Verified: full `pnpm build` + full unit suite green (152 files / 3214 passed, +3 new, 0 failed).
Pushed for CI to verify the Tower Integration Tests job. **Do NOT merge until CI green + re-approval.**
- **CI RESULT: all 6 checks GREEN** — Tower Integration Tests (the formerly-red job), Unit Tests,
CLI Tests (ubuntu + macos), CLI Integration Tests, Package Install Verification all pass.
Holding at the pr gate awaiting architect re-approval. Not merging.

## Final state: HOLD (architect decision 2026-06-01)
- pr gate human-approved; consult 2-0 APPROVE (Gemini+Codex HIGH); all 6 CI checks green; MERGEABLE.
- Merge BLOCKED by GitHub branch protection: `reviewDecision=REVIEW_REQUIRED`, 0 formal reviews.
Single-account repo can't self-approve, so the required review can't be satisfied right now.
- Architect decision: **leave PR #953 OPEN, do NOT merge, do NOT use --admin.** Stand by at pr gate;
revisit once a reviewer account / branch-protection config is sorted. Work itself is complete & green.

47 changes: 47 additions & 0 deletions packages/codev/src/agent-farm/__tests__/tower-instances.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import {
killTerminalWithShellper,
stopInstance,
removeArchitect,
waitForTerminalExit,
type InstanceDeps,
} from '../servers/tower-instances.js';

Expand Down Expand Up @@ -126,6 +127,52 @@ describe('tower-instances', () => {
});
});

// =========================================================================
// waitForTerminalExit
// =========================================================================

describe('waitForTerminalExit', () => {
it('resolves immediately when the session has already exited (Bugfix #905 regression)', async () => {
// Bugfix #905 made shellper exits propagate before teardown attaches its
// once('exit') listener. A session that already fired 'exit' must NOT wait
// out the 5s safety timeout — it should short-circuit on status.
const once = vi.fn();
const manager = {
getSession: vi.fn().mockReturnValue({ status: 'exited', once }),
} as unknown as Parameters<typeof waitForTerminalExit>[0];

await expect(waitForTerminalExit(manager, 'term-1', 5000)).resolves.toBeUndefined();
// Must not even attach the listener for an already-exited session.
expect(once).not.toHaveBeenCalled();
});

it('resolves immediately when the session is missing', async () => {
const manager = {
getSession: vi.fn().mockReturnValue(undefined),
} as unknown as Parameters<typeof waitForTerminalExit>[0];

await expect(waitForTerminalExit(manager, 'gone', 5000)).resolves.toBeUndefined();
});

it('waits for the exit event on a still-running session', async () => {
let exitHandler: (() => void) | null = null;
const manager = {
getSession: vi.fn().mockReturnValue({
status: 'running',
once: (event: string, cb: () => void) => {
if (event === 'exit') exitHandler = cb;
},
}),
} as unknown as Parameters<typeof waitForTerminalExit>[0];

const promise = waitForTerminalExit(manager, 'term-2', 5000);
expect(exitHandler).toBeTypeOf('function');
// Fire the event the listener is waiting for.
exitHandler!();
await expect(promise).resolves.toBeUndefined();
});
});

// =========================================================================
// registerKnownWorkspace
// =========================================================================
Expand Down
10 changes: 9 additions & 1 deletion packages/codev/src/agent-farm/servers/tower-instances.ts
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,22 @@ export function isIntentionallyStopping(workspacePath: string): boolean {
* because the session was already gone), the promise still resolves so we
* don't block the stop indefinitely.
*/
function waitForTerminalExit(manager: TerminalManager, terminalId: string, timeoutMs = 5000): Promise<void> {
export function waitForTerminalExit(manager: TerminalManager, terminalId: string, timeoutMs = 5000): Promise<void> {
const session = manager.getSession(terminalId);
// Defensive: if the session is gone or doesn't look like an EventEmitter
// (e.g. a test stub), there's nothing to wait for — resolve immediately so
// callers aren't blocked.
if (!session || typeof (session as { once?: unknown }).once !== 'function') {
return Promise.resolve();
}
// The session may have already exited before we attach the listener below.
// `once('exit')` only catches *future* emissions, so a session that already
// fired 'exit' would otherwise wait out the full safety timeout. Bugfix #905
// made shellper exits propagate earlier (EXIT is now replayed to clients that
// connect after the PTY exits), which surfaced this: short-circuit instead.
if ((session as { status?: string }).status === 'exited') {
return Promise.resolve();
}
return new Promise<void>((resolve) => {
let settled = false;
const finish = () => {
Expand Down
33 changes: 33 additions & 0 deletions packages/codev/src/terminal/__tests__/shellper-process.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,39 @@ describe('ShellperProcess', () => {
mockPty.simulateExit(1);
expect(shellper.hasExited).toBe(true);
});

it('replays EXIT frame to a client that connects after the PTY already exited (Bugfix #905)', async () => {
shellper = new ShellperProcess(createMockPty, socketPath);
await shellper.start('/bin/bash', [], '/tmp', {}, 80, 24);

// PTY exits BEFORE any client connects — the original EXIT broadcast
// reaches nobody. A late-connecting client must still be told.
mockPty.simulateExit(1);
expect(shellper.hasExited).toBe(true);

const socket = net.createConnection(socketPath);
const parser = createFrameParser();
socket.pipe(parser);

const framesPromise = collectFramesFor(parser, 300);
await new Promise<void>((resolve, reject) => {
socket.on('error', reject);
socket.on('connect', () => {
socket.write(encodeHello({ version: PROTOCOL_VERSION, clientType: 'tower' }));
resolve();
});
});

const frames = await framesPromise;
const exitFrame = frames.find((f) => f.type === FrameType.EXIT);
expect(exitFrame).toBeDefined();
if (exitFrame) {
const msg = parseJsonPayload<ExitMessage>(exitFrame.payload);
expect(msg.code).toBe(1);
}

socket.destroy();
});
});

describe('SPAWN handling', () => {
Expand Down
20 changes: 18 additions & 2 deletions packages/codev/src/terminal/shellper-process.ts
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,12 @@ export class ShellperProcess extends EventEmitter {
// shellper that has emitted nothing yet still reports a sane value.
private lastDataAt: number = Date.now();
private exited = false;
// Exit info retained after the PTY exits so clients that connect *after*
// exit still learn the session ended. Without this, a fast-exiting command
// (e.g. `exit 1`) can finish before the manager connects its client, the
// EXIT broadcast reaches nobody, and the client hangs forever waiting for an
// EXIT frame that already went out (Bugfix #905).
private exitInfo: { code: number; signal: string | null } | null = null;

constructor(
private readonly ptyFactory: () => IShellperPty,
Expand Down Expand Up @@ -125,6 +131,7 @@ export class ShellperProcess extends EventEmitter {
rows: number,
): void {
this.exited = false;
this.exitInfo = null;
const pty = this.ptyFactory();
this.pty = pty;
pty.spawn(command, args, {
Expand Down Expand Up @@ -155,10 +162,11 @@ export class ShellperProcess extends EventEmitter {
this.log(`PTY exited: code=${exitInfo.exitCode}, signal=${exitInfo.signal ?? null}`);

this.exited = true;
const exitFrame = encodeExit({
this.exitInfo = {
code: exitInfo.exitCode,
signal: exitInfo.signal != null ? String(exitInfo.signal) : null,
});
};
const exitFrame = encodeExit(this.exitInfo);

this.broadcast(exitFrame);

Expand Down Expand Up @@ -375,6 +383,14 @@ export class ShellperProcess extends EventEmitter {
socket.write(encodeReplay(replayData));
}

// If the PTY already exited before this client connected, the original
// EXIT broadcast missed it. Replay the retained EXIT frame so the client
// doesn't hang waiting for an event that already fired (Bugfix #905).
if (this.exited && this.exitInfo) {
socket.write(encodeExit(this.exitInfo));
this.log(`EXIT replayed to late connection ${connectionId}: code=${this.exitInfo.code}`);
}

return connectionId;
}

Expand Down
Loading