BioRouter QA: 5 agent improvements + 36-app test suite (Xiaomi MiMo)#5
Merged
Conversation
…ations
Expand the Auto Visualiser from 8 to 32 tools, fix the recurring
"visualization cannot be generated" failures, and make the whole extension
more robust. Each tool still follows one pipeline; the shared parts now live
in a `common` module so adding a figure is ~10 lines + one template.
Hardening / robustness (fixes the error users hit on live generation & reopen)
- Lenient enum parsing: chart/donut types now accept any case/whitespace
("Line", "LINE", " line ", "Doughnut") via custom Deserialize instead of
failing at the rmcp argument layer — the most common live-generation failure.
- JSON-safe injection: data is serialized through `js_data`, which neutralizes
`</script>` breakout (`<` -> <) and the JS-illegal U+2028/U+2029
separators; free text (titles, mermaid source) is HTML-escaped. The Mermaid
template now renders explicitly and surfaces invalid syntax as a friendly
error card instead of a blank frame.
- Per-tool size limits + semantic validation with actionable messages
(unknown sankey/network node, non-square chord matrix, label/data length
mismatch, out-of-range lat/lng, lower>upper CI, non-positive log-scale, ...).
- Shared client runtime (templates/_common.js): theme-aware palette/colors
(light/dark via CSS vars), auto-resize, and a global error boundary that
shows a card on any uncaught render error rather than an empty iframe.
- Debug HTML dumps are gated behind BIOROUTER_AUTOVIS_DEBUG / debug builds and
written to a per-process file in the app cache dir (no more world-writable,
race-prone, Windows-nonexistent /tmp paths).
Size fix (mitigates large diagrams failing to re-render on chat reopen)
- BIOROUTER_AUTOVIS_CDN=1 references libraries from pinned CDN tags instead of
inlining them, shrinking the persisted/reloaded blob from megabytes (Mermaid
is ~3 MB) to a few KB. Default stays inlined for offline/self-contained use.
New tools (reusing the already-vendored D3 / Chart.js / Leaflet / Mermaid)
- D3: render_network (force graph), render_heatmap, render_sunburst,
render_dendrogram, render_calendar_heatmap, render_boxplot, render_wordcloud,
render_kaplan_meier, render_forest
- Chart.js: render_histogram, render_bubble, render_area, render_gauge,
render_volcano, render_manhattan
- Mermaid typed wrappers (compile structured JSON to Mermaid, more reliable
than hand-authored syntax): render_flowchart, render_gantt, render_sequence,
render_mindmap, render_timeline, render_er_diagram, render_state_diagram,
render_class_diagram
- Leaflet: render_choropleth (value-shaded GeoJSON regions)
Architecture
- New tools live in tools_extra/tools_charts/tools_d3/tools_geo, each an
additional `#[tool_router(...)]` impl block combined in `new()` via
ToolRouter `+`. Hierarchical tools reuse `TreemapNode`; geo reuses `MapCenter`.
Tests & verification
- 58 unit tests (happy paths + edge cases: empty/mismatched/out-of-range
inputs, escaping, lenient parsing) — all green.
- Every visualization render-verified in a real browser via Playwright
(32/32, no console errors / error cards), plus an end-to-end check in the
Electron GUI confirming show_chart renders inline.
Docs: tool descriptions with examples, updated server instructions, the
configure.rs catalog entry, and a new Auto Visualiser section in CLAUDE.md.
…-tree changes
Primary change — a new built-in MCP extension, Agent Drafter, that builds
interactive artifacts (static pages, or apps with an embedded BioRouter agent)
and exports them as standalone projects:
- crates/biorouter-mcp/src/agent_drafter/: the server (mod.rs), HTML/asset
rendering (render.rs), persistence (store.rs), and starter templates
(starter.html, agent.js, theme.css).
- Registered as a built-in in crates/biorouter-mcp/src/lib.rs
(pub mod agent_drafter; builtin!(agent_drafter, AgentDrafterServer)).
- Surfaced in the CLI `configure` extension list and in
ui/desktop/src/built-in-extensions.json (disabled by default).
- crates/biorouter-mcp/tests/agent_drafter_registered.rs verifies the
extension registers as a built-in.
Also captures other changes that were sitting uncommitted in the shared
working tree, so nothing is left dangling:
- providers/openai.rs: built-in DeepSeek model-id aliases so a saved config
keeps working after deepseek-chat/-reasoner are retired (same intent as the
merged DeepSeek future-proofing PR).
- session/tui/{app,mod}.rs: the interactive CLI (ratatui) UX overhaul —
soft-wrapping/auto-growing input box, bottom-pinned input bar, live token
streaming, shaded user turns, slash-palette Enter-accept, and box-drawing
Markdown tables (same content as the merged CLI TUI PR).
- knowledge/soul.rs and system.rs: rustfmt-only line wrapping.
- cli/commands/configure.rs: list agent_drafter among configurable extensions.
Verified with `cargo check --workspace --all-targets` (clean).
…ng, drop experimental note
Three fixes found while testing every tool live in the desktop app driven by
mimo-v2.5-pro (verified via the LLM request logs):
1. Stringified `data` arguments (the real "tools don't work" bug). Some models
(mimo-v2.5-pro) serialize the nested tool argument as a JSON *string*:
{"data": "{...}"} instead of {"data": {...}}. The typed `data` field could
not deserialize a string, so EVERY autovisualiser tool was declined at the
rmcp layer ("interpreted as a string rather than a structured object").
gpt-style models send an object and worked; mimo did not. Added
`common::de_flexible`, applied via `#[serde(deserialize_with)]` to all 32
tools' `data` fields, so each accepts a JSON object OR a stringified-JSON
string. Added a regression test using mimo's exact input shape.
2. render_map dimensions. The map template never reported its height to the
auto-resizing MCP-UI iframe, so the iframe size didn't match the 600px map
(the "weird dimensions"). It now calls BioRouterViz.autoResize(),
map.invalidateSize(), and reportSize() once Leaflet knows its real size.
3. Removed the "MCP UI is experimental and may change at any time." note shown
beneath every inline visualization (ToolCallWithResponse.tsx).
Verified in-app with mimo: all 32 tools render inline (explicit calls), and 31
organic prompts that never name a tool select the correct tool. 59 unit tests
pass (incl. stringified-data and object-data cases).
Xiaomi MiMo (and other models using the common 'file_path' convention) intermittently emit the text_editor parameter as 'file_path' instead of 'path'. Because the field was required, serde rejected the call before the handler with an opaque '-32602: missing field path', costing the agent a recovery turn. A serde alias makes the tool accept either key. Found while QA-testing the CLI by building 100 apps with MiMo.
…rors 429s are always transient, but DEFAULT_MAX_RETRIES=3 (1s->2s->4s, ~7s total) is exhausted by sustained throttling (e.g. concurrent sessions on one key), after which the agent loop surfaces a turn-ending error and a build is lost. Give only RateLimitExceeded a dedicated deeper budget (RATE_LIMIT_MAX_RETRIES=8, ~2min with the existing 30s cap) via effective_max_retries(), applied in both retry_operation and with_retry. Generic errors keep the conservative 3. Found while QA-testing the CLI by building 100 apps with MiMo (rate limits truncated builds). Includes unit tests.
…y/checkpoint Stop hook Two complementary version-control improvements, motivated by QA where the agent routinely left work uncommitted, used non-reproducible layouts, or declared a C++ build done without ever compiling. Plan A (light touch, always on): when the working dir is a git repo, the developer extension now injects a git stanza into its instructions — current branch, uncommitted-change count, and a concise policy (commit logical units with clear messages; .gitignore build artifacts; never rewrite history or run destructive git ops without an explicit request). Emits nothing outside a repo. Plan B (opt-in enforcement): scripts/hooks/verify-and-checkpoint.sh, a Stop hook that blocks finishing until (1) the tree is committed (reproducible from a clean checkout) and (2) with BIOROUTER_VERIFY_BUILD=1, the project builds and tests pass for its toolchain (cargo/cmake/pytest/npm) — including a fallback that runs *test* binaries when a CMake project forgets add_test(). Failure-open; bounded by the runtime's Stop-hook block cap. Docs in docs/hooks/verify-and-checkpoint.md. Both live in shared backend code, so they apply to the CLI and the GUI.
F4: 'biorouter run --resume --name X' previously errored ('No session found with
name X', rc=1) when the session didn't exist — a dead end for a typo'd name or a
session started with --no-session. Now it warns and starts a fresh session with
that name. The no-identifier --resume case likewise falls back to a new session
instead of erroring when there's nothing to resume.
C1: tool-call path headers over-abbreviated every directory to a single letter
(path: ~/D/b/a/s/algorithms/bfs.rs), making it hard to tell which file was being
edited. shorten_path now collapses only the middle to a single ellipsis and keeps
the in-project tail in full (~/.../project/src/algorithms/bfs.rs). Test updated.
C2: when the agent hit its action budget it emitted a generic 'reached the maximum number of actions' message, indistinguishable from a normal completion and giving no number. It now states the limit (max_turns), clarifies it stopped on the cap rather than because the task is necessarily done, points at the max_turns / BIOROUTER_MAX_TURNS knob, and logs per-action progress (N/max) so an observer can tell budget-exhaustion from a real finish.
Moves the BioRouter CLI QA workspace (previously a sibling on the Desktop) into the project as biorouter-testing-apps/, per request. Contents: 12 real multi-file apps built by the BioRouter CLI (Xiaomi MiMo) across Rust/Python/C++ (~1,149 passing tests) — pathfinding, sorting-visualizer, BST family, graph toolkit, string matching, dynamic programming, hash tables, LZ77+Huffman compression, bignum, bloom/cuckoo filters, FASTA/FASTQ toolkit, sequence alignment — plus the QA artifacts (CHECKLIST, PROGRESS, FAILURE_LOG, UX_BENCHMARK, FINAL_REPORT, IMPROVEMENTS, ISSUES/round-1 & 2, specs/, and the build_app.sh / interact.sh harness). Notes: - The 13 nested git repos (12 apps + QA root) were flattened so their content is tracked here; each repo's full per-app history is preserved as a recoverable git bundle under _history-bundles/ (restore with: git clone <name>.bundle). - Regenerable build artifacts (target/, build/, .venv, __pycache__, *.log) and the user's separate autovis-phase3/ dataset are gitignored and NOT committed.
Apps 13-15 (Python phylogenetics 156 tests, Python variant-caller 124 tests, C++ kmer-counter 82/82) imported as flat files; per-app histories bundled to _history-bundles/. App 15 is the first C++ app to build+test clean on the first try (7 commits) — early evidence the round-3 git-context/reproducibility improvements are working. Adds ISSUES/round-3-report.md + tracker updates.
Apps 16-20 imported as flat files (R gene-expression 67 tests, protein-structure [partial, no tests], blast-lite 60, genome-assembly 70, motif-finder 94/97); per-app histories bundled to _history-bundles/. Adds ISSUES/round-4-report.md. Bioinformatics batch (apps 11-20) complete: R toolchain validated; loop resilient to keychain + deleted-binary disruptions. ~20 apps, ~1930 passing tests across Rust/Python/C++/R.
Apps 21-25 (FHIR parser 253 tests, R survival 78, ICD/SNOMED mapper [partial, no tests], clinical-trial-sim 126/128, Rust DDI graph 115) imported as flat files; histories bundled. Adds ISSUES/round-5-report.md. ~25 apps, ~2500 tests across Rust/Python/C++/R. Lead finding: premature stream stops (3x) at code->tests transitions — round-5 improvement (continue-on-truncation) follows.
… mitigation Apps 26-30 (risk scores 200 tests, SQL cohort builder 60, R biomarker 65, SEIR 82, DICOM 124) imported as flat files; histories bundled. Biomedical batch (21-30) complete: 31 apps, ~3220 tests across Rust/Python/C++/R. Adds round-6 report + the round-5 IMPROVEMENTS doc. build_app.sh now prompts incremental tests — a zero-risk mitigation for the premature-stop pattern that appears effective (apps 27-30 had no premature stops). R now 3/3 clean one-shots.
…t transport, TUI version, testing-apps
Consolidates the shared working tree across several concurrent work streams
so nothing is left uncommitted. Verified with
`cargo check --workspace --all-targets` (clean).
Version bump → 1.85.4
- Cargo workspace version plus the four desktop JSON mirrors
(package.json, package-lock.json ×2, openapi.json); Cargo.lock refreshed.
- CLI TUI greeting now prints the running version
(env!("CARGO_PKG_VERSION")), so the interactive session shows v1.85.4.
Agent Drafter extension — UI + behavior expansion
- crates/biorouter-mcp/src/agent_drafter/: server (mod.rs), render, and
store updates, with reworked starter templates (agent.js, starter.html,
and a substantially expanded theme.css).
- ui/desktop: MCPUIResourceRenderer renders embedded agent UI resources;
main.ts / preload.ts wiring; a bundled-extensions.json entry; plus
scripts/build-main-dev.mjs and scripts/agent-drafter/ helper scripts.
ACP (Agent Communication Protocol) — WebSocket transport
- crates/biorouter-acp/src/server.rs: serve ACP over a WebSocket in
addition to stdio (DEFAULT_WS_ADDR 127.0.0.1:11577), with the dependency
additions in Cargo.toml and a new tests/ws_transport_test.rs.
- crates/biorouter-cli/src/cli.rs: `acp --ws [ADDR]` flag to start the
WebSocket transport (e.g. for agent-enabled artifacts).
Testing harness (biorouter-testing-apps/)
- Seven new standalone statistics test apps (Python / R / C++), their
specs and git history bundles, and a round-7 issue report; FAILURE_LOG
and PROGRESS updates.
… request) Apps 31-35 fully verified (Bayesian MCMC 108, R GLM install-clean, ARIMA 70, R hypothesis-testing 111, bootstrap 90); apps 36-37 (PCA C++, R survival power) partial — stopped mid-build when the loop was paused. Per-app histories bundled to _history-bundles/. Adds ISSUES/round-7-report.md + FAILURE_LOG/IMPROVEMENTS/ PROGRESS updates + the incremental-test harness mitigation. 36 apps fully verified, ~3600 passing tests across Rust/Python/C++/R.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidates the BioRouter CLI QA effort and the resulting agent improvements.
BioRouter improvements (all in shared backend crates → apply to CLI and GUI)
file_pathalias for text_editorpath(kills-32602wasted-turn class)--resumefallback + readable tool-call pathsQA suite (biorouter-testing-apps/)
36 apps fully verified (~3,600 passing tests across Rust/Python/C++/R), each built by the
BioRouter CLI agent (Xiaomi MiMo), with 8 issue reports, FINAL_REPORT, FAILURE_LOG, UX_BENCHMARK,
and per-app git-history bundles.
Also includes bundled recent work in this branch's base: autovisualiser hardening + 24
visualizations, the Agent Drafter built-in extension, and the 1.85.4 release bump.
🤖 Generated with Claude Code