Skip to content

async experiment#1

Draft
alex-remedios-aisi wants to merge 17 commits intomainfrom
alex-remedios/2026-04-21-misc
Draft

async experiment#1
alex-remedios-aisi wants to merge 17 commits intomainfrom
alex-remedios/2026-04-21-misc

Conversation

@alex-remedios-aisi
Copy link
Copy Markdown
Collaborator

No description provided.

alex-remedios-aisi and others added 17 commits April 21, 2026 13:53
- All harness markers (Queued / Running / Done / Error / Skipped /
  timeout) now carry an `[nb mcp]` prefix and render as stderr-stream
  so they're visually distinct from the cell's own stdout.
- Running banner stays pinned at the top of cell outputs throughout
  streaming, so the agent/user can tell a cell is live even when
  its output cadence is slow.
- Queued, Running, and Done/Error footers include a wall-clock
  timestamp with timezone (e.g. `10:18:27 UTC`) — unambiguous across
  remote servers — plus duration on completion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Notebooks written outside the MCP (hand-edited, older nbformat) can have
cells without an `id` field. `_flush_outputs_to_disk` is keyed on cell id
and returned silently when it couldn't find the cell, so the Queued
marker (written by index) would appear and then nothing else — no
Running banner, no kernel output, no Done footer — even though the
kernel was happily running.

exec_cell_to_disk now assigns a uuid-based id and persists it before the
first flush. As a defense, _flush_outputs_to_disk now warns to stderr
rather than silently dropping writes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Writes to ./.nb_mcp.log in the MCP server's CWD. Covers:
- server start
- job submit + per-cell running/done/error
- kernel start/ready/stop + interrupt requests
- dropped-output warnings (replaces the stderr print in
  _flush_outputs_to_disk)
- unhandled exceptions in the worker thread (with traceback)

Level via NB_MCP_LOG_LEVEL, path via NB_MCP_LOG_PATH. Falls back to
stderr if the log file can't be opened.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`execute_code` was passing its timeout straight into
`client.get_iopub_msg(timeout=...)` — that's the idle gap between
messages, not the total runtime. A chatty cell that prints every
second kept resetting the window, so a 10-minute budget could run
forever. In the field this showed up as `timeout=600` still executing
at 14 minutes.

The timeout is now a hard deadline anchored at the start. We poll
iopub in 1s slices and re-check the deadline each time. When it fires
we call an `interrupt_fn` (wired to `kernels.interrupt`) so the
kernel actually stops — otherwise the busy kernel would block every
subsequent cell in the job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under heavy IOPub traffic (inspect_ai progress bars, per-step training
logs, display_data every ~20s) a single bad ZMQ frame desynchronises
the client's message parser — every subsequent read raises
\`ValueError("'<IDS|MSG>' is not in list")\`. Before this change, that
exception would bubble up out of the job worker thread, crashing the
in-flight exec and leaving the agent with no way to reattach. The
kernel itself is fine — GPU, training, file writes all still alive.

Fix:
- exec_runner.execute_code now catches unexpected iopub exceptions,
  logs them, and invokes a recover_fn up to 3 times. recover_fn
  rebuilds the client's ZMQ channels against the same KernelManager
  (kernels.reset_client) — kernel process untouched. We re-subscribe,
  keep filtering by the original msg_id, and continue.
- If recovery fails or the cap is hit, we append a clear
  \`[nb mcp] iopub desync\` marker telling the agent to use
  exec_status / read_cell and return cleanly instead of crashing.
- Wired recover_fn through exec_cell_to_disk, jobs, and run_scratch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The blocking `wait` MCP tool held up the whole tool-call slot for up
to its timeout, blocking the agent from interacting with the user or
doing any other work while a long cell ran. Claude Code's Monitor
tool streams stdout lines as event notifications while the agent
keeps working, which is the better UX for long-running cells.

Changes:
- New subcommand `nb watch --job <id> [--path <nb>]`. Tails
  `.nb_mcp.log` (or `NB_MCP_LOG_PATH`), filters to the target job,
  emits one formatted line per interesting event (submit, cell
  start/done/error, kernel lifecycle, final complete/error), exits
  when the job ends. Line-buffered stdout.
- Non-scratch exec tools now append a ready-to-use Monitor hint to
  their response:
    Monitor(command='uv run nb watch --job abc123 --path nb.ipynb')
- `wait` MCP tool removed, along with `jobs.wait_for_job` helper.
- New test `test_watch_cli.py` covers happy-path event filtering
  and the startup-timeout path.
- CLAUDE.md updated with the new tool + CLI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every cell going through a Monitor hand-off was overkill for quick work.
Non-scratch exec tools now wait up to \`block_for\` seconds (default 10)
for the background job to finish:
- completes in time → return the full status inline, no Monitor needed
- still running → return the Monitor-ready command as before

block_for=0 is fire-and-forget. All four exec tools take the new
parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the log file only had lifecycle events (job submit, cell
running, done/error). A chatty cell printing for ten minutes would
produce zero notifications between "running" and "done" — Monitor users
couldn't see the job was healthy without also \`read_cell\`-ing the
notebook.

Now each cell has its own rate-limited progress emitter. When the cell
produces new stream output, we wait at least
NB_MCP_PROGRESS_INTERVAL_SEC (default 1.0s) before emitting one INFO
line: \`job X cell [N] out: <last line>\` (200 char truncation). Matches
the nb watch filter by job id, so Monitor delivers each line as an
event. Set the env var to 0 to disable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One-per-second was too chatty for realistic workloads (training loops,
evals). Ten seconds is a saner default — still plenty of mid-run
notifications but 10x less noise in the log file and in Monitor.
Override via NB_MCP_PROGRESS_INTERVAL_SEC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: a cell with zero stream output (time.sleep, GPU compute,
blocking I/O) produced no log lines between the initial "cell
running" and final "cell done" events. Monitor stayed silent for the
whole run — indistinguishable from a hung kernel from the agent's
perspective.

Now: a per-cell heartbeat thread logs \`job X cell [N] still running
(Ns elapsed)\` at the progress interval, but only when the cell is
genuinely silent. The heartbeat skips if:
- the progress emitter logged within the interval (chatty cell), or
- the kernel produced output within the interval (output arrived
  but was throttled from being logged)

So a chatty cell's real output always wins; the heartbeat only
surfaces for actually-silent work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-notebook exec_status is fine when you already know which file you
care about. But when debugging "is anything running?" or "do I have
stale kernels?", the agent had no way to get a global view without
checking each notebook.

New MCP tool: \`status()\` (no args). Lists every registered kernel
(path, alive, pid) and every active/recent job. Backed by two new
accessors — kernels.list_all and jobs.list_all_active /
list_all_finished — which read the in-memory state of the running
MCP server.

New CLI: \`nb status\`. Useful when the MCP is down or when debugging
outside Claude Code. Reads .nb_mcp.log to reconstruct job history and
shells out to pgrep for live ipykernel pids. Not as accurate as the
MCP tool (log-based, not in-memory) but always available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 covers the whole session: the pivot from blocking wait to
Monitor-driven \`nb watch\`, the marker/log/progress/heartbeat work
that makes tailing worthwhile, and the hardening fixes that fell
out (wall-clock timeout, iopub recovery, cell-id backfill).

16 frames the open question of reattaching to kernels across MCP
restarts — what's required, what needs deciding, and a sketch of
the shape without committing to a direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Disk-quota exhaustion and other write-path errors can cause cells to
fail without the agent seeing a clear error — propagation is there
but the logging is weak. Captured in journal 16 as a related
hardening pass rather than creating a separate entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a training run's kernel gets OOM-killed mid-execution, the
client's iopub reads fail. The existing recovery code tried to
reset channels, which also fails (no kernel to talk to), and
emitted a generic "iopub desync — channel recovery failed" marker.
The job then continued optimistically against the dead kernel,
with subsequent cells producing more opaque failures.

Now:
- kernels.reset_client raises a new KernelDeadError when the
  underlying km.is_alive() is False, or when wait_for_ready times
  out after the rebuild.
- exec_runner.execute_code catches it distinctly. Emits
  log.error "kernel died during cell execution: …" and an
  nbformat error output (ename=NbMcpKernelDied) on the cell, so
  the job is marked ERROR and subsequent cells are skipped.
- Inline cell marker now says "kernel died mid-execution" with
  operator guidance (run status(), expect fresh kernel next exec)
  instead of the misleading "iopub desync".

Agents monitoring the log via \`nb watch\` now see a clear root
cause instead of a vague channel message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant