Reliability research#7
Merged
Merged
Conversation
756e6ef to
ef71b59
Compare
Deep-dive notes on a reliability stack for self-hosted LLM tool calling — guardrails, compaction, synthetic respond tool, ablation, and eval significance — captured as next-steps.md for replication on our stack.
Empirical results from the IEEE preprint — compounding-error math, 8B+framework parity with frontier, backend hidden-variable swings, compaction strategy table, and replication takeaways. Companion to next-steps.md.
…l framework, and BFCL integration Three additions from a fuller read of the ADR directory: the _resolve_service lesson on why strict backends need a soft-error type, the five-tier diagnostic eval framework (lambda → ablated lambda → stateful → ablated stateful → stateful+strict), the information_loss future scenario concept, and the BFCL integration architecture as external validation.
…on caveat, and related-work section From the author's HN-thread discussion: per-guardrail impact ranking (retry nudges 24-49pt, error recovery ~10pt, rescue + compaction no eval signal but kept for production), 97-config coverage figure, and comparison to Instructor / LangChain / DSPy / Outlines / tool_choice=any explaining why they stack rather than substitute.
…and JSONL row schema Three operational additions: --reasoning-budget 0 workaround for post-2026-04-10 llama.cpp builds on reasoning models (silent hangs); production observability wiring patterns for on_message / on_compact / on_chunk callbacks with logger + Prometheus + alerting examples; and the JSONL row schema for resumable batch eval including the per-run identity/outcome/mechanics/history fields plus the rig provenance field for multi-rig datasets.
…rdrails)
Implements the in-loop reliability guardrails from the IEEE preprint / next-steps
specs so cheaper, smaller models reach near-frontier completion rates on
multi-step tool workflows. Auto-enabled for weak-tier models and any model in the
sampling-defaults table; frontier models pay near-zero cost.
What's new (each gated by the auto-enable rule in reliability-config.ts):
- Synthetic Respond tool — gives small models a structured terminal action so
the loop never has to disambiguate text vs. tool call. Auto-injected on the
wire only for weak-tier / known small-model deployments; single-Respond
batches short-circuit to text-done.
- Message tagging — every conversation append carries a MessageType
(system_prompt / user_input / tool_call / tool_result / reasoning /
text_response / step_nudge / prerequisite_nudge / retry_nudge /
context_warning / summary). Stripped at the wire boundary.
- Tiered 3-phase compaction — deterministic text manipulation, no LLM call.
P1 drops nudges + truncates tool_results to 200 chars; P2 also drops
tool_results; P3 drops reasoning + text_response. keep_recent=2 iteration
boundaries. LLM-summary path stays as a Phase-4 emergency fallback.
- ResponseValidator + Nudge — stateless validator emits structured retry /
unknown-tool / step / prerequisite nudges with three-tier escalation for
premature-terminal attempts.
- StepEnforcer + tool prerequisites — opt-in via requiredSteps / terminalTools
on AgentOptions; declared ToolDefinition.prerequisites validated at
registry-build time. StepEnforcementError / PrerequisiteError on exhaustion.
- ToolResolutionError — new tool-author exception type ("valid request, no
data") with softError tagging; the hard-error counter only bumps on
non-resolution throws, letting models fumble 8+ wrong-key lookups within
the iteration budget while still bailing on real bugs.
- Context threshold warnings — once-per-session-per-threshold transient
"context filling up" / "nearly full" injection at 65% / 80% with re-arm on
drop.
- Reasoning fold + think-tag utilities ([THINK] / <think>) for future wire
serialization.
- Per-model sampling defaults map (Ministral / Qwen3 / Granite 4 / Gemma 4)
with strict / non-strict policy.
- Per-call sampling overrides on every provider (Ollama, Anthropic, all
OpenAI-compat via shared adapter). New ChatOptions fields: topP, topK,
minP, repeatPenalty, presencePenalty, recommendedSampling.
- Anthropic synthetic tool_result for unpaired tool_use (load-bearing for the
step/prereq nudge path) + tool_choice="any" exposed via
ChatOptions.forceToolCall (auto-enabled on weak-tier Anthropic models).
- Backend context discovery via API — Ollama /api/show, llama-server /props.
- Observability events — step-nudge, prerequisite-nudge, step-completed,
context-warning, respond-stripped; compaction event now carries
phase: 0|1|2|3|4.
Out of scope (API-only constraint): ServerManager / launch flags / nvidia-smi
/ multi-slot / proxy server / ablation framework / eval harness.
Test coverage: 1260 unit tests (+80) across 11 new test files. Full plan in
~/.claude/plans/recursive-stargazing-truffle.md.
CI lint flagged five rules after the reliability-stack commit. All behavior-preserving extractions: - run-agent.ts: split the runAgent generator under the 300-line cap by extracting fireUserPromptSubmit, handleNoToolCallsBranch, detectRespondShortCircuit, emitEnforcerObservability, resolveChainPointer, captureChainPointer, maybeInjectContextWarning, settleCleanBatch, isHardErrorBudgetExhausted, and hasAnyPrereqs. Replaced an inline import() type annotation with the proper type-import (consistent-type-imports rule). - shared.ts: pulled resolveSampling's repetitive conditional ladders into mergeDefaults / mergeOverrides helpers backed by a single ChatOptions→ResolvedSampling field map; complexity drops well below the cap. - anthropic.ts: extracted anthropicExtras(sampling, forceToolCall, hasTools) so both the streaming and non-streaming params builders stay under the per-method complexity cap. - ollama.ts: extracted ollamaOptions(model, maxTokens, sampling) used by both chat() and chatNoStream(); same intent. Tests still green (1260 unit + 48 e2e, 4 PTY skipped). No public API changes.
Applies fixes from the branch review at
~/.claude/plans/humming-bubbling-lollipop.md.
- providers: thread `providerName` through buildChatBody so the
per-model sampling-defaults diagnostic INFO line fires for OpenAI-
compat providers too (openai, cerebras, llamacpp, groq, openrouter,
vercel, mistral, workersai, copilot, googleaistudio, opencodezen).
Defaults already applied silently — only the log was missing.
- docs: move research notes off the repo root.
next-steps.md → docs/reliability/next-steps.md
next_steps_paper.md → docs/reliability/paper-findings.md
Updates 19 in-source `next-steps.md §N` references to the new path.
- tiered-compact: `changed` is now cumulative across P1/P2/P3 via a
running OR. Previously returned only the final phase's delta, so
a P1-only mutation followed by no-op P2 falsely reported
`changed=false` to callers (logging-only impact, but misleading).
- anthropic: flush `pendingToolUseIds` on *any* plain-message
boundary in `splitMessagesForAnthropic`, not just user-role. The
conversation grammar never produces consecutive assistant runs
today, but the guard hardens against future regressions (summary
injection, mid-turn rewrites) that could leak unpaired tool_use
to the wire — Anthropic 400s on that shape.
- tools/types: codify the `ToolResult` flag matrix as a doc comment.
Enumerates the 7 reachable combinations and the mutual-exclusion
rules (softError ⊕ hardError, both imply success=false, etc.).
- tests (+23): three new suites/sections.
* resolveSampling three-tier merge chain (8 cases) — instance
defaults → per-model table → per-call overrides precedence,
camel→snake field mapping, recommendedSampling on unknown
model is a no-op.
* autoEnableForModel tier gating (5 cases) — weak/medium/strong
paths, sampling-profile override of tier, forceToolCall only
fires for weak-tier Anthropic.
* tiered-compact (3 cases) — cumulative `changed` after multi-
phase run, tool_call ↔ tool_result pairing across P2, metadata
preservation through all phases.
No public API changes (buildChatBody gains an optional field).
1283 unit tests pass (+23). No production behavior change beyond
the diagnostic-log and Anthropic flush hardening above.
Implements the punch list from the reliability-research review: - Anthropic: tests for the unpaired tool_use → is_error tool_result shim. - Recovery state: per-pipeline clone + deterministic merge to eliminate the read-modify-write race in parallel Delegate batches. - ChatOptions.thinking tri-state (true/false/'auto') with resolver, ThinkingNotSupportedError, and inline <think>/[THINK] discard path for Ollama's leak case. Tag-parsing helpers moved to src/utils/ to honour the providers→core architecture boundary. - Ollama native↔prompt auto-downgrade: per-model tool-mode cache via getModelInfo, prompt-mode preamble + history downgrade (tool→user, assistant.tool_calls→<tool_call> JSON). - Typed Nudge.meta so observability events no longer regex-parse the rendered template. - resolveSampling rejects instance-level seed (per-call only per §17) and gains immutability tests. - Threshold-warning re-fire integration test across turns. - Comment clarification on per-call vs batch-level hard-error reset. Unit suite: 1289 → 1320 tests, all passing.
Splits the tool-mode/sampling/thinking branches out of `chat` and `chatNoStream` into a private `buildChatRequest`, and the chunk→ ChatChunk shaping into a free `mapOllamaChunk`. Brings both methods back under the eslint complexity cap (25) after the prompt-mode auto-downgrade landed.
The strict test tsconfig (tsconfig.test.json) enforces all fields on ToolDefinition; the unit-only `tsx --test` run elides this check, so the missing discriminator only surfaced in CI.
Punch list from the second branch review:
- anthropic: emit a `tool-result-orphan` diagnostic provider-log entry
when a `tool_result.tool_call_id` doesn't match any pending
`tool_use` in `splitMessagesForAnthropic`. Anthropic 400s on that
shape anyway; the log shortens the diagnostic path from "opaque
400" to "exact id that went unmatched".
- text-tool-parser: case-insensitive tool-name matching with
canonicalisation. Small models routinely lowercase ("read" vs
"Read"); the validator's unknown-tool check and the registry's
`get()` are both case-insensitive, but the parser was stricter and
silently dropped lowercase calls before either could see them.
Now the parser does the same case-fold and rewrites the name to
the registered canonical form so downstream consumers (which
assume exact case) stay happy. Applied to the <tool_call> tag,
Hermes-style <function=...> tag, and bare-JSON fallback paths.
- nudges: remove unused `RESPOND_TOOL` re-export (no callers).
- docs: replace the misleading "Aim: reproduce this on our own
stack" preface in next-steps.md with a status block clarifying
that the file is upstream research notes from the IEEE preprint
+ Python framework, not the TS implementation. Add a small
section→file map so readers know which `src/` paths actually
implement each §N anchor referenced from doc-comments.
- doc-comment refs: normalise two shorthand `next-steps.md §N`
comments to the full `docs/reliability/next-steps.md §N` path
(shared.ts, think-tags.ts) so every reference in `src/` uses the
same form.
Tests (+8): five new parser cases covering lowercase tag/bare/
Hermes-tag and the still-rejects-truly-unknown guarantee; three
new cases under `runToolCalls — parallel Delegate batches >
StepEnforcer + parallel Delegate` pinning that the shared
StepEnforcer correctly records all successful siblings (and skips
failed ones) while recovery state is cloned per-pipeline.
Unit suite: 1324 → 1332 tests, all passing.
- format: prettier --write across the branch's reliability surface (docs/, src/core/agent/, src/providers/, tests). Pure whitespace / line-wrapping; the codebase's prettier config is the source of truth. - knip: drop `export` from five symbols that were never imported externally: ReliabilityError (still extended by the three subclasses in the same file), DEFAULT_KEEP_RECENT (sole caller is in the same file), NudgeKind + NudgeMeta (used inside `Nudge`), and ThinkExtractResult (return type of an exported function — callers infer it). Also drop the stale `type ThinkExtractResult` re-export from reasoning.ts. - run-agent: prettier reformat bumped `runAgent` to 306 lines, over the eslint max-lines-per-function cap (300). Extract `buildStepEnforcer(options, toolRegistry)` to bring it back under; pure behaviour-preserving extraction of the conditional StepEnforcer construction block.
c6a7bc0 to
eac4f24
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.