async experiment by alex-remedios-aisi · Pull Request #1 · treebeardtech/autonomous-notebooks

alex-remedios-aisi · 2026-04-21T13:53:37Z

No description provided.

- All harness markers (Queued / Running / Done / Error / Skipped / timeout) now carry an `[nb mcp]` prefix and render as stderr-stream so they're visually distinct from the cell's own stdout. - Running banner stays pinned at the top of cell outputs throughout streaming, so the agent/user can tell a cell is live even when its output cadence is slow. - Queued, Running, and Done/Error footers include a wall-clock timestamp with timezone (e.g. `10:18:27 UTC`) — unambiguous across remote servers — plus duration on completion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Notebooks written outside the MCP (hand-edited, older nbformat) can have cells without an `id` field. `_flush_outputs_to_disk` is keyed on cell id and returned silently when it couldn't find the cell, so the Queued marker (written by index) would appear and then nothing else — no Running banner, no kernel output, no Done footer — even though the kernel was happily running. exec_cell_to_disk now assigns a uuid-based id and persists it before the first flush. As a defense, _flush_outputs_to_disk now warns to stderr rather than silently dropping writes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Writes to ./.nb_mcp.log in the MCP server's CWD. Covers: - server start - job submit + per-cell running/done/error - kernel start/ready/stop + interrupt requests - dropped-output warnings (replaces the stderr print in _flush_outputs_to_disk) - unhandled exceptions in the worker thread (with traceback) Level via NB_MCP_LOG_LEVEL, path via NB_MCP_LOG_PATH. Falls back to stderr if the log file can't be opened. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`execute_code` was passing its timeout straight into `client.get_iopub_msg(timeout=...)` — that's the idle gap between messages, not the total runtime. A chatty cell that prints every second kept resetting the window, so a 10-minute budget could run forever. In the field this showed up as `timeout=600` still executing at 14 minutes. The timeout is now a hard deadline anchored at the start. We poll iopub in 1s slices and re-check the deadline each time. When it fires we call an `interrupt_fn` (wired to `kernels.interrupt`) so the kernel actually stops — otherwise the busy kernel would block every subsequent cell in the job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under heavy IOPub traffic (inspect_ai progress bars, per-step training logs, display_data every ~20s) a single bad ZMQ frame desynchronises the client's message parser — every subsequent read raises \`ValueError("'<IDS|MSG>' is not in list")\`. Before this change, that exception would bubble up out of the job worker thread, crashing the in-flight exec and leaving the agent with no way to reattach. The kernel itself is fine — GPU, training, file writes all still alive. Fix: - exec_runner.execute_code now catches unexpected iopub exceptions, logs them, and invokes a recover_fn up to 3 times. recover_fn rebuilds the client's ZMQ channels against the same KernelManager (kernels.reset_client) — kernel process untouched. We re-subscribe, keep filtering by the original msg_id, and continue. - If recovery fails or the cap is hit, we append a clear \`[nb mcp] iopub desync\` marker telling the agent to use exec_status / read_cell and return cleanly instead of crashing. - Wired recover_fn through exec_cell_to_disk, jobs, and run_scratch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The blocking `wait` MCP tool held up the whole tool-call slot for up to its timeout, blocking the agent from interacting with the user or doing any other work while a long cell ran. Claude Code's Monitor tool streams stdout lines as event notifications while the agent keeps working, which is the better UX for long-running cells. Changes: - New subcommand `nb watch --job <id> [--path <nb>]`. Tails `.nb_mcp.log` (or `NB_MCP_LOG_PATH`), filters to the target job, emits one formatted line per interesting event (submit, cell start/done/error, kernel lifecycle, final complete/error), exits when the job ends. Line-buffered stdout. - Non-scratch exec tools now append a ready-to-use Monitor hint to their response: Monitor(command='uv run nb watch --job abc123 --path nb.ipynb') - `wait` MCP tool removed, along with `jobs.wait_for_job` helper. - New test `test_watch_cli.py` covers happy-path event filtering and the startup-timeout path. - CLAUDE.md updated with the new tool + CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Every cell going through a Monitor hand-off was overkill for quick work. Non-scratch exec tools now wait up to \`block_for\` seconds (default 10) for the background job to finish: - completes in time → return the full status inline, no Monitor needed - still running → return the Monitor-ready command as before block_for=0 is fire-and-forget. All four exec tools take the new parameter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the log file only had lifecycle events (job submit, cell running, done/error). A chatty cell printing for ten minutes would produce zero notifications between "running" and "done" — Monitor users couldn't see the job was healthy without also \`read_cell\`-ing the notebook. Now each cell has its own rate-limited progress emitter. When the cell produces new stream output, we wait at least NB_MCP_PROGRESS_INTERVAL_SEC (default 1.0s) before emitting one INFO line: \`job X cell [N] out: <last line>\` (200 char truncation). Matches the nb watch filter by job id, so Monitor delivers each line as an event. Set the env var to 0 to disable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One-per-second was too chatty for realistic workloads (training loops, evals). Ten seconds is a saner default — still plenty of mid-run notifications but 10x less noise in the log file and in Monitor. Override via NB_MCP_PROGRESS_INTERVAL_SEC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Before: a cell with zero stream output (time.sleep, GPU compute, blocking I/O) produced no log lines between the initial "cell running" and final "cell done" events. Monitor stayed silent for the whole run — indistinguishable from a hung kernel from the agent's perspective. Now: a per-cell heartbeat thread logs \`job X cell [N] still running (Ns elapsed)\` at the progress interval, but only when the cell is genuinely silent. The heartbeat skips if: - the progress emitter logged within the interval (chatty cell), or - the kernel produced output within the interval (output arrived but was throttled from being logged) So a chatty cell's real output always wins; the heartbeat only surfaces for actually-silent work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-notebook exec_status is fine when you already know which file you care about. But when debugging "is anything running?" or "do I have stale kernels?", the agent had no way to get a global view without checking each notebook. New MCP tool: \`status()\` (no args). Lists every registered kernel (path, alive, pid) and every active/recent job. Backed by two new accessors — kernels.list_all and jobs.list_all_active / list_all_finished — which read the in-memory state of the running MCP server. New CLI: \`nb status\`. Useful when the MCP is down or when debugging outside Claude Code. Reads .nb_mcp.log to reconstruct job history and shells out to pgrep for live ipykernel pids. Not as accurate as the MCP tool (log-based, not in-memory) but always available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

15 covers the whole session: the pivot from blocking wait to Monitor-driven \`nb watch\`, the marker/log/progress/heartbeat work that makes tailing worthwhile, and the hardening fixes that fell out (wall-clock timeout, iopub recovery, cell-id backfill). 16 frames the open question of reattaching to kernels across MCP restarts — what's required, what needs deciding, and a sketch of the shape without committing to a direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Disk-quota exhaustion and other write-path errors can cause cells to fail without the agent seeing a clear error — propagation is there but the logging is weak. Captured in journal 16 as a related hardening pass rather than creating a separate entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a training run's kernel gets OOM-killed mid-execution, the client's iopub reads fail. The existing recovery code tried to reset channels, which also fails (no kernel to talk to), and emitted a generic "iopub desync — channel recovery failed" marker. The job then continued optimistically against the dead kernel, with subsequent cells producing more opaque failures. Now: - kernels.reset_client raises a new KernelDeadError when the underlying km.is_alive() is False, or when wait_for_ready times out after the rebuild. - exec_runner.execute_code catches it distinctly. Emits log.error "kernel died during cell execution: …" and an nbformat error output (ename=NbMcpKernelDied) on the cell, so the job is marked ERROR and subsequent cells are skipped. - Inline cell marker now says "kernel died mid-execution" with operator guidance (run status(), expect fresh kernel next exec) instead of the misleading "iopub desync". Agents monitoring the log via \`nb watch\` now see a clear root cause instead of a vague channel message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alex-remedios-aisi and others added 17 commits April 21, 2026 13:53

async experiment

699bcc9

wait tool

8d324ff

mv

6c98163

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async experiment#1

async experiment#1
alex-remedios-aisi wants to merge 17 commits intomainfrom
alex-remedios/2026-04-21-misc

alex-remedios-aisi commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alex-remedios-aisi commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant