BioRouter QA: 5 agent improvements + 36-app test suite (Xiaomi MiMo) by Broccolito · Pull Request #5 · BaranziniLab/biorouter

Broccolito · 2026-06-20T08:35:48Z

Consolidates the BioRouter CLI QA effort and the resulting agent improvements.

BioRouter improvements (all in shared backend crates → apply to CLI and GUI)

fix(developer): accept file_path alias for text_editor path (kills -32602 wasted-turn class)
fix(providers): deeper transient-429 retry budget (8 attempts vs 3)
feat(developer,hooks): git context in the developer extension + verify/checkpoint Stop hook
fix(cli): graceful --resume fallback + readable tool-call paths
fix(agent): quantified per-turn action-limit stop

QA suite (biorouter-testing-apps/)

36 apps fully verified (~3,600 passing tests across Rust/Python/C++/R), each built by the
BioRouter CLI agent (Xiaomi MiMo), with 8 issue reports, FINAL_REPORT, FAILURE_LOG, UX_BENCHMARK,
and per-app git-history bundles.

Also includes bundled recent work in this branch's base: autovisualiser hardening + 24
visualizations, the Agent Drafter built-in extension, and the 1.85.4 release bump.

🤖 Generated with Claude Code

…ations Expand the Auto Visualiser from 8 to 32 tools, fix the recurring "visualization cannot be generated" failures, and make the whole extension more robust. Each tool still follows one pipeline; the shared parts now live in a `common` module so adding a figure is ~10 lines + one template. Hardening / robustness (fixes the error users hit on live generation & reopen) - Lenient enum parsing: chart/donut types now accept any case/whitespace ("Line", "LINE", " line ", "Doughnut") via custom Deserialize instead of failing at the rmcp argument layer — the most common live-generation failure. - JSON-safe injection: data is serialized through `js_data`, which neutralizes `</script>` breakout (`<` -> <) and the JS-illegal U+2028/U+2029 separators; free text (titles, mermaid source) is HTML-escaped. The Mermaid template now renders explicitly and surfaces invalid syntax as a friendly error card instead of a blank frame. - Per-tool size limits + semantic validation with actionable messages (unknown sankey/network node, non-square chord matrix, label/data length mismatch, out-of-range lat/lng, lower>upper CI, non-positive log-scale, ...). - Shared client runtime (templates/_common.js): theme-aware palette/colors (light/dark via CSS vars), auto-resize, and a global error boundary that shows a card on any uncaught render error rather than an empty iframe. - Debug HTML dumps are gated behind BIOROUTER_AUTOVIS_DEBUG / debug builds and written to a per-process file in the app cache dir (no more world-writable, race-prone, Windows-nonexistent /tmp paths). Size fix (mitigates large diagrams failing to re-render on chat reopen) - BIOROUTER_AUTOVIS_CDN=1 references libraries from pinned CDN tags instead of inlining them, shrinking the persisted/reloaded blob from megabytes (Mermaid is ~3 MB) to a few KB. Default stays inlined for offline/self-contained use. New tools (reusing the already-vendored D3 / Chart.js / Leaflet / Mermaid) - D3: render_network (force graph), render_heatmap, render_sunburst, render_dendrogram, render_calendar_heatmap, render_boxplot, render_wordcloud, render_kaplan_meier, render_forest - Chart.js: render_histogram, render_bubble, render_area, render_gauge, render_volcano, render_manhattan - Mermaid typed wrappers (compile structured JSON to Mermaid, more reliable than hand-authored syntax): render_flowchart, render_gantt, render_sequence, render_mindmap, render_timeline, render_er_diagram, render_state_diagram, render_class_diagram - Leaflet: render_choropleth (value-shaded GeoJSON regions) Architecture - New tools live in tools_extra/tools_charts/tools_d3/tools_geo, each an additional `#[tool_router(...)]` impl block combined in `new()` via ToolRouter `+`. Hierarchical tools reuse `TreemapNode`; geo reuses `MapCenter`. Tests & verification - 58 unit tests (happy paths + edge cases: empty/mismatched/out-of-range inputs, escaping, lenient parsing) — all green. - Every visualization render-verified in a real browser via Playwright (32/32, no console errors / error cards), plus an end-to-end check in the Electron GUI confirming show_chart renders inline. Docs: tool descriptions with examples, updated server instructions, the configure.rs catalog entry, and a new Auto Visualiser section in CLAUDE.md.

…-tree changes Primary change — a new built-in MCP extension, Agent Drafter, that builds interactive artifacts (static pages, or apps with an embedded BioRouter agent) and exports them as standalone projects: - crates/biorouter-mcp/src/agent_drafter/: the server (mod.rs), HTML/asset rendering (render.rs), persistence (store.rs), and starter templates (starter.html, agent.js, theme.css). - Registered as a built-in in crates/biorouter-mcp/src/lib.rs (pub mod agent_drafter; builtin!(agent_drafter, AgentDrafterServer)). - Surfaced in the CLI `configure` extension list and in ui/desktop/src/built-in-extensions.json (disabled by default). - crates/biorouter-mcp/tests/agent_drafter_registered.rs verifies the extension registers as a built-in. Also captures other changes that were sitting uncommitted in the shared working tree, so nothing is left dangling: - providers/openai.rs: built-in DeepSeek model-id aliases so a saved config keeps working after deepseek-chat/-reasoner are retired (same intent as the merged DeepSeek future-proofing PR). - session/tui/{app,mod}.rs: the interactive CLI (ratatui) UX overhaul — soft-wrapping/auto-growing input box, bottom-pinned input bar, live token streaming, shaded user turns, slash-palette Enter-accept, and box-drawing Markdown tables (same content as the merged CLI TUI PR). - knowledge/soul.rs and system.rs: rustfmt-only line wrapping. - cli/commands/configure.rs: list agent_drafter among configurable extensions. Verified with `cargo check --workspace --all-targets` (clean).

…ng, drop experimental note Three fixes found while testing every tool live in the desktop app driven by mimo-v2.5-pro (verified via the LLM request logs): 1. Stringified `data` arguments (the real "tools don't work" bug). Some models (mimo-v2.5-pro) serialize the nested tool argument as a JSON *string*: {"data": "{...}"} instead of {"data": {...}}. The typed `data` field could not deserialize a string, so EVERY autovisualiser tool was declined at the rmcp layer ("interpreted as a string rather than a structured object"). gpt-style models send an object and worked; mimo did not. Added `common::de_flexible`, applied via `#[serde(deserialize_with)]` to all 32 tools' `data` fields, so each accepts a JSON object OR a stringified-JSON string. Added a regression test using mimo's exact input shape. 2. render_map dimensions. The map template never reported its height to the auto-resizing MCP-UI iframe, so the iframe size didn't match the 600px map (the "weird dimensions"). It now calls BioRouterViz.autoResize(), map.invalidateSize(), and reportSize() once Leaflet knows its real size. 3. Removed the "MCP UI is experimental and may change at any time." note shown beneath every inline visualization (ToolCallWithResponse.tsx). Verified in-app with mimo: all 32 tools render inline (explicit calls), and 31 organic prompts that never name a tool select the correct tool. 59 unit tests pass (incl. stringified-data and object-data cases).

Xiaomi MiMo (and other models using the common 'file_path' convention) intermittently emit the text_editor parameter as 'file_path' instead of 'path'. Because the field was required, serde rejected the call before the handler with an opaque '-32602: missing field path', costing the agent a recovery turn. A serde alias makes the tool accept either key. Found while QA-testing the CLI by building 100 apps with MiMo.

…rors 429s are always transient, but DEFAULT_MAX_RETRIES=3 (1s->2s->4s, ~7s total) is exhausted by sustained throttling (e.g. concurrent sessions on one key), after which the agent loop surfaces a turn-ending error and a build is lost. Give only RateLimitExceeded a dedicated deeper budget (RATE_LIMIT_MAX_RETRIES=8, ~2min with the existing 30s cap) via effective_max_retries(), applied in both retry_operation and with_retry. Generic errors keep the conservative 3. Found while QA-testing the CLI by building 100 apps with MiMo (rate limits truncated builds). Includes unit tests.

…y/checkpoint Stop hook Two complementary version-control improvements, motivated by QA where the agent routinely left work uncommitted, used non-reproducible layouts, or declared a C++ build done without ever compiling. Plan A (light touch, always on): when the working dir is a git repo, the developer extension now injects a git stanza into its instructions — current branch, uncommitted-change count, and a concise policy (commit logical units with clear messages; .gitignore build artifacts; never rewrite history or run destructive git ops without an explicit request). Emits nothing outside a repo. Plan B (opt-in enforcement): scripts/hooks/verify-and-checkpoint.sh, a Stop hook that blocks finishing until (1) the tree is committed (reproducible from a clean checkout) and (2) with BIOROUTER_VERIFY_BUILD=1, the project builds and tests pass for its toolchain (cargo/cmake/pytest/npm) — including a fallback that runs *test* binaries when a CMake project forgets add_test(). Failure-open; bounded by the runtime's Stop-hook block cap. Docs in docs/hooks/verify-and-checkpoint.md. Both live in shared backend code, so they apply to the CLI and the GUI.

F4: 'biorouter run --resume --name X' previously errored ('No session found with name X', rc=1) when the session didn't exist — a dead end for a typo'd name or a session started with --no-session. Now it warns and starts a fresh session with that name. The no-identifier --resume case likewise falls back to a new session instead of erroring when there's nothing to resume. C1: tool-call path headers over-abbreviated every directory to a single letter (path: ~/D/b/a/s/algorithms/bfs.rs), making it hard to tell which file was being edited. shorten_path now collapses only the middle to a single ellipsis and keeps the in-project tail in full (~/.../project/src/algorithms/bfs.rs). Test updated.

C2: when the agent hit its action budget it emitted a generic 'reached the maximum number of actions' message, indistinguishable from a normal completion and giving no number. It now states the limit (max_turns), clarifies it stopped on the cap rather than because the task is necessarily done, points at the max_turns / BIOROUTER_MAX_TURNS knob, and logs per-action progress (N/max) so an observer can tell budget-exhaustion from a real finish.

Moves the BioRouter CLI QA workspace (previously a sibling on the Desktop) into the project as biorouter-testing-apps/, per request. Contents: 12 real multi-file apps built by the BioRouter CLI (Xiaomi MiMo) across Rust/Python/C++ (~1,149 passing tests) — pathfinding, sorting-visualizer, BST family, graph toolkit, string matching, dynamic programming, hash tables, LZ77+Huffman compression, bignum, bloom/cuckoo filters, FASTA/FASTQ toolkit, sequence alignment — plus the QA artifacts (CHECKLIST, PROGRESS, FAILURE_LOG, UX_BENCHMARK, FINAL_REPORT, IMPROVEMENTS, ISSUES/round-1 & 2, specs/, and the build_app.sh / interact.sh harness). Notes: - The 13 nested git repos (12 apps + QA root) were flattened so their content is tracked here; each repo's full per-app history is preserved as a recoverable git bundle under _history-bundles/ (restore with: git clone <name>.bundle). - Regenerable build artifacts (target/, build/, .venv, __pycache__, *.log) and the user's separate autovis-phase3/ dataset are gitignored and NOT committed.

…env-overridable)

Apps 13-15 (Python phylogenetics 156 tests, Python variant-caller 124 tests, C++ kmer-counter 82/82) imported as flat files; per-app histories bundled to _history-bundles/. App 15 is the first C++ app to build+test clean on the first try (7 commits) — early evidence the round-3 git-context/reproducibility improvements are working. Adds ISSUES/round-3-report.md + tracker updates.

Apps 16-20 imported as flat files (R gene-expression 67 tests, protein-structure [partial, no tests], blast-lite 60, genome-assembly 70, motif-finder 94/97); per-app histories bundled to _history-bundles/. Adds ISSUES/round-4-report.md. Bioinformatics batch (apps 11-20) complete: R toolchain validated; loop resilient to keychain + deleted-binary disruptions. ~20 apps, ~1930 passing tests across Rust/Python/C++/R.

Apps 21-25 (FHIR parser 253 tests, R survival 78, ICD/SNOMED mapper [partial, no tests], clinical-trial-sim 126/128, Rust DDI graph 115) imported as flat files; histories bundled. Adds ISSUES/round-5-report.md. ~25 apps, ~2500 tests across Rust/Python/C++/R. Lead finding: premature stream stops (3x) at code->tests transitions — round-5 improvement (continue-on-truncation) follows.

… mitigation Apps 26-30 (risk scores 200 tests, SQL cohort builder 60, R biomarker 65, SEIR 82, DICOM 124) imported as flat files; histories bundled. Biomedical batch (21-30) complete: 31 apps, ~3220 tests across Rust/Python/C++/R. Adds round-6 report + the round-5 IMPROVEMENTS doc. build_app.sh now prompts incremental tests — a zero-risk mitigation for the premature-stop pattern that appears effective (apps 27-30 had no premature stops). R now 3/3 clean one-shots.

…t transport, TUI version, testing-apps Consolidates the shared working tree across several concurrent work streams so nothing is left uncommitted. Verified with `cargo check --workspace --all-targets` (clean). Version bump → 1.85.4 - Cargo workspace version plus the four desktop JSON mirrors (package.json, package-lock.json ×2, openapi.json); Cargo.lock refreshed. - CLI TUI greeting now prints the running version (env!("CARGO_PKG_VERSION")), so the interactive session shows v1.85.4. Agent Drafter extension — UI + behavior expansion - crates/biorouter-mcp/src/agent_drafter/: server (mod.rs), render, and store updates, with reworked starter templates (agent.js, starter.html, and a substantially expanded theme.css). - ui/desktop: MCPUIResourceRenderer renders embedded agent UI resources; main.ts / preload.ts wiring; a bundled-extensions.json entry; plus scripts/build-main-dev.mjs and scripts/agent-drafter/ helper scripts. ACP (Agent Communication Protocol) — WebSocket transport - crates/biorouter-acp/src/server.rs: serve ACP over a WebSocket in addition to stdio (DEFAULT_WS_ADDR 127.0.0.1:11577), with the dependency additions in Cargo.toml and a new tests/ws_transport_test.rs. - crates/biorouter-cli/src/cli.rs: `acp --ws [ADDR]` flag to start the WebSocket transport (e.g. for agent-enabled artifacts). Testing harness (biorouter-testing-apps/) - Seven new standalone statistics test apps (Python / R / C++), their specs and git history bundles, and a round-7 issue report; FAILURE_LOG and PROGRESS updates.

… request) Apps 31-35 fully verified (Bayesian MCMC 108, R GLM install-clean, ARIMA 70, R hypothesis-testing 111, bootstrap 90); apps 36-37 (PCA C++, R survival power) partial — stopped mid-build when the loop was paused. Per-app histories bundled to _history-bundles/. Adds ISSUES/round-7-report.md + FAILURE_LOG/IMPROVEMENTS/ PROGRESS updates + the incremental-test harness mitigation. 36 apps fully verified, ~3600 passing tests across Rust/Python/C++/R.

Broccolito added 16 commits June 19, 2026 09:52

qa: repoint build harness ROOT to in-project biorouter-testing-apps (…

46a006d

…env-overridable)

Broccolito merged commit 35d8ce8 into main Jun 20, 2026
2 checks passed

Broccolito deleted the improve/git-and-report-followups branch June 20, 2026 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BioRouter QA: 5 agent improvements + 36-app test suite (Xiaomi MiMo)#5

BioRouter QA: 5 agent improvements + 36-app test suite (Xiaomi MiMo)#5
Broccolito merged 16 commits into
mainfrom
improve/git-and-report-followups

Broccolito commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Broccolito commented Jun 20, 2026

BioRouter improvements (all in shared backend crates → apply to CLI and GUI)

QA suite (biorouter-testing-apps/)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant