Reliability research by vilaca · Pull Request #7 · vilaca/factory

vilaca · 2026-05-22T18:57:36Z

No description provided.

Deep-dive notes on a reliability stack for self-hosted LLM tool calling — guardrails, compaction, synthetic respond tool, ablation, and eval significance — captured as next-steps.md for replication on our stack.

Empirical results from the IEEE preprint — compounding-error math, 8B+framework parity with frontier, backend hidden-variable swings, compaction strategy table, and replication takeaways. Companion to next-steps.md.

…l framework, and BFCL integration Three additions from a fuller read of the ADR directory: the _resolve_service lesson on why strict backends need a soft-error type, the five-tier diagnostic eval framework (lambda → ablated lambda → stateful → ablated stateful → stateful+strict), the information_loss future scenario concept, and the BFCL integration architecture as external validation.

…on caveat, and related-work section From the author's HN-thread discussion: per-guardrail impact ranking (retry nudges 24-49pt, error recovery ~10pt, rescue + compaction no eval signal but kept for production), 97-config coverage figure, and comparison to Instructor / LangChain / DSPy / Outlines / tool_choice=any explaining why they stack rather than substitute.

…and JSONL row schema Three operational additions: --reasoning-budget 0 workaround for post-2026-04-10 llama.cpp builds on reasoning models (silent hangs); production observability wiring patterns for on_message / on_compact / on_chunk callbacks with logger + Prometheus + alerting examples; and the JSONL row schema for resumable batch eval including the per-run identity/outcome/mechanics/history fields plus the rig provenance field for multi-rig datasets.

…rdrails) Implements the in-loop reliability guardrails from the IEEE preprint / next-steps specs so cheaper, smaller models reach near-frontier completion rates on multi-step tool workflows. Auto-enabled for weak-tier models and any model in the sampling-defaults table; frontier models pay near-zero cost. What's new (each gated by the auto-enable rule in reliability-config.ts): - Synthetic Respond tool — gives small models a structured terminal action so the loop never has to disambiguate text vs. tool call. Auto-injected on the wire only for weak-tier / known small-model deployments; single-Respond batches short-circuit to text-done. - Message tagging — every conversation append carries a MessageType (system_prompt / user_input / tool_call / tool_result / reasoning / text_response / step_nudge / prerequisite_nudge / retry_nudge / context_warning / summary). Stripped at the wire boundary. - Tiered 3-phase compaction — deterministic text manipulation, no LLM call. P1 drops nudges + truncates tool_results to 200 chars; P2 also drops tool_results; P3 drops reasoning + text_response. keep_recent=2 iteration boundaries. LLM-summary path stays as a Phase-4 emergency fallback. - ResponseValidator + Nudge — stateless validator emits structured retry / unknown-tool / step / prerequisite nudges with three-tier escalation for premature-terminal attempts. - StepEnforcer + tool prerequisites — opt-in via requiredSteps / terminalTools on AgentOptions; declared ToolDefinition.prerequisites validated at registry-build time. StepEnforcementError / PrerequisiteError on exhaustion. - ToolResolutionError — new tool-author exception type ("valid request, no data") with softError tagging; the hard-error counter only bumps on non-resolution throws, letting models fumble 8+ wrong-key lookups within the iteration budget while still bailing on real bugs. - Context threshold warnings — once-per-session-per-threshold transient "context filling up" / "nearly full" injection at 65% / 80% with re-arm on drop. - Reasoning fold + think-tag utilities ([THINK] / <think>) for future wire serialization. - Per-model sampling defaults map (Ministral / Qwen3 / Granite 4 / Gemma 4) with strict / non-strict policy. - Per-call sampling overrides on every provider (Ollama, Anthropic, all OpenAI-compat via shared adapter). New ChatOptions fields: topP, topK, minP, repeatPenalty, presencePenalty, recommendedSampling. - Anthropic synthetic tool_result for unpaired tool_use (load-bearing for the step/prereq nudge path) + tool_choice="any" exposed via ChatOptions.forceToolCall (auto-enabled on weak-tier Anthropic models). - Backend context discovery via API — Ollama /api/show, llama-server /props. - Observability events — step-nudge, prerequisite-nudge, step-completed, context-warning, respond-stripped; compaction event now carries phase: 0|1|2|3|4. Out of scope (API-only constraint): ServerManager / launch flags / nvidia-smi / multi-slot / proxy server / ablation framework / eval harness. Test coverage: 1260 unit tests (+80) across 11 new test files. Full plan in ~/.claude/plans/recursive-stargazing-truffle.md.

CI lint flagged five rules after the reliability-stack commit. All behavior-preserving extractions: - run-agent.ts: split the runAgent generator under the 300-line cap by extracting fireUserPromptSubmit, handleNoToolCallsBranch, detectRespondShortCircuit, emitEnforcerObservability, resolveChainPointer, captureChainPointer, maybeInjectContextWarning, settleCleanBatch, isHardErrorBudgetExhausted, and hasAnyPrereqs. Replaced an inline import() type annotation with the proper type-import (consistent-type-imports rule). - shared.ts: pulled resolveSampling's repetitive conditional ladders into mergeDefaults / mergeOverrides helpers backed by a single ChatOptions→ResolvedSampling field map; complexity drops well below the cap. - anthropic.ts: extracted anthropicExtras(sampling, forceToolCall, hasTools) so both the streaming and non-streaming params builders stay under the per-method complexity cap. - ollama.ts: extracted ollamaOptions(model, maxTokens, sampling) used by both chat() and chatNoStream(); same intent. Tests still green (1260 unit + 48 e2e, 4 PTY skipped). No public API changes.

Applies fixes from the branch review at ~/.claude/plans/humming-bubbling-lollipop.md. - providers: thread `providerName` through buildChatBody so the per-model sampling-defaults diagnostic INFO line fires for OpenAI- compat providers too (openai, cerebras, llamacpp, groq, openrouter, vercel, mistral, workersai, copilot, googleaistudio, opencodezen). Defaults already applied silently — only the log was missing. - docs: move research notes off the repo root. next-steps.md → docs/reliability/next-steps.md next_steps_paper.md → docs/reliability/paper-findings.md Updates 19 in-source `next-steps.md §N` references to the new path. - tiered-compact: `changed` is now cumulative across P1/P2/P3 via a running OR. Previously returned only the final phase's delta, so a P1-only mutation followed by no-op P2 falsely reported `changed=false` to callers (logging-only impact, but misleading). - anthropic: flush `pendingToolUseIds` on *any* plain-message boundary in `splitMessagesForAnthropic`, not just user-role. The conversation grammar never produces consecutive assistant runs today, but the guard hardens against future regressions (summary injection, mid-turn rewrites) that could leak unpaired tool_use to the wire — Anthropic 400s on that shape. - tools/types: codify the `ToolResult` flag matrix as a doc comment. Enumerates the 7 reachable combinations and the mutual-exclusion rules (softError ⊕ hardError, both imply success=false, etc.). - tests (+23): three new suites/sections. * resolveSampling three-tier merge chain (8 cases) — instance defaults → per-model table → per-call overrides precedence, camel→snake field mapping, recommendedSampling on unknown model is a no-op. * autoEnableForModel tier gating (5 cases) — weak/medium/strong paths, sampling-profile override of tier, forceToolCall only fires for weak-tier Anthropic. * tiered-compact (3 cases) — cumulative `changed` after multi- phase run, tool_call ↔ tool_result pairing across P2, metadata preservation through all phases. No public API changes (buildChatBody gains an optional field). 1283 unit tests pass (+23). No production behavior change beyond the diagnostic-log and Anthropic flush hardening above.

Implements the punch list from the reliability-research review: - Anthropic: tests for the unpaired tool_use → is_error tool_result shim. - Recovery state: per-pipeline clone + deterministic merge to eliminate the read-modify-write race in parallel Delegate batches. - ChatOptions.thinking tri-state (true/false/'auto') with resolver, ThinkingNotSupportedError, and inline <think>/[THINK] discard path for Ollama's leak case. Tag-parsing helpers moved to src/utils/ to honour the providers→core architecture boundary. - Ollama native↔prompt auto-downgrade: per-model tool-mode cache via getModelInfo, prompt-mode preamble + history downgrade (tool→user, assistant.tool_calls→<tool_call> JSON). - Typed Nudge.meta so observability events no longer regex-parse the rendered template. - resolveSampling rejects instance-level seed (per-call only per §17) and gains immutability tests. - Threshold-warning re-fire integration test across turns. - Comment clarification on per-call vs batch-level hard-error reset. Unit suite: 1289 → 1320 tests, all passing.

Splits the tool-mode/sampling/thinking branches out of `chat` and `chatNoStream` into a private `buildChatRequest`, and the chunk→ ChatChunk shaping into a free `mapOllamaChunk`. Brings both methods back under the eslint complexity cap (25) after the prompt-mode auto-downgrade landed.

The strict test tsconfig (tsconfig.test.json) enforces all fields on ToolDefinition; the unit-only `tsx --test` run elides this check, so the missing discriminator only surfaced in CI.

Punch list from the second branch review: - anthropic: emit a `tool-result-orphan` diagnostic provider-log entry when a `tool_result.tool_call_id` doesn't match any pending `tool_use` in `splitMessagesForAnthropic`. Anthropic 400s on that shape anyway; the log shortens the diagnostic path from "opaque 400" to "exact id that went unmatched". - text-tool-parser: case-insensitive tool-name matching with canonicalisation. Small models routinely lowercase ("read" vs "Read"); the validator's unknown-tool check and the registry's `get()` are both case-insensitive, but the parser was stricter and silently dropped lowercase calls before either could see them. Now the parser does the same case-fold and rewrites the name to the registered canonical form so downstream consumers (which assume exact case) stay happy. Applied to the <tool_call> tag, Hermes-style <function=...> tag, and bare-JSON fallback paths. - nudges: remove unused `RESPOND_TOOL` re-export (no callers). - docs: replace the misleading "Aim: reproduce this on our own stack" preface in next-steps.md with a status block clarifying that the file is upstream research notes from the IEEE preprint + Python framework, not the TS implementation. Add a small section→file map so readers know which `src/` paths actually implement each §N anchor referenced from doc-comments. - doc-comment refs: normalise two shorthand `next-steps.md §N` comments to the full `docs/reliability/next-steps.md §N` path (shared.ts, think-tags.ts) so every reference in `src/` uses the same form. Tests (+8): five new parser cases covering lowercase tag/bare/ Hermes-tag and the still-rejects-truly-unknown guarantee; three new cases under `runToolCalls — parallel Delegate batches > StepEnforcer + parallel Delegate` pinning that the shared StepEnforcer correctly records all successful siblings (and skips failed ones) while recovery state is cloned per-pipeline. Unit suite: 1324 → 1332 tests, all passing.

- format: prettier --write across the branch's reliability surface (docs/, src/core/agent/, src/providers/, tests). Pure whitespace / line-wrapping; the codebase's prettier config is the source of truth. - knip: drop `export` from five symbols that were never imported externally: ReliabilityError (still extended by the three subclasses in the same file), DEFAULT_KEEP_RECENT (sole caller is in the same file), NudgeKind + NudgeMeta (used inside `Nudge`), and ThinkExtractResult (return type of an exported function — callers infer it). Also drop the stale `type ThinkExtractResult` re-export from reasoning.ts. - run-agent: prettier reformat bumped `runAgent` to 306 lines, over the eslint max-lines-per-function cap (300). Extract `buildStepEnforcer(options, toolRegistry)` to bring it back under; pure behaviour-preserving extraction of the conditional StepEnforcer construction block.

vilaca force-pushed the reliability-research branch 9 times, most recently from 756e6ef to ef71b59 Compare May 23, 2026 01:21

vilaca added 13 commits May 23, 2026 18:23

docs: add small-model reliability research writeup

c98d40b

Deep-dive notes on a reliability stack for self-hosted LLM tool calling — guardrails, compaction, synthetic respond tool, ablation, and eval significance — captured as next-steps.md for replication on our stack.

docs: add paper findings companion doc

7061fa8

Empirical results from the IEEE preprint — compounding-error math, 8B+framework parity with frontier, backend hidden-variable swings, compaction strategy table, and replication takeaways. Companion to next-steps.md.

test(ollama-prompt-mode): add required type: 'function' discriminator

df9c575

The strict test tsconfig (tsconfig.test.json) enforces all fields on ToolDefinition; the unit-only `tsx --test` run elides this check, so the missing discriminator only surfaced in CI.

vilaca force-pushed the reliability-research branch from c6a7bc0 to eac4f24 Compare May 23, 2026 17:24

vilaca merged commit 0bcf942 into main May 23, 2026
11 of 12 checks passed

vilaca deleted the reliability-research branch May 23, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability research#7

Reliability research#7
vilaca merged 13 commits into
mainfrom
reliability-research

vilaca commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vilaca commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant