Skip to content

[AI-9] Phase 3: daemon consumes eval catalog + posts V3 results#200

Merged
realtonyyoung merged 9 commits into
mainfrom
tonyyoung/ai-9-phase-3-daemon-catalog
Jun 30, 2026
Merged

[AI-9] Phase 3: daemon consumes eval catalog + posts V3 results#200
realtonyyoung merged 9 commits into
mainfrom
tonyyoung/ai-9-phase-3-daemon-catalog

Conversation

@realtonyyoung

Copy link
Copy Markdown
Contributor

AI-9 Phase 3 (kcap-cli) — daemon consumes the eval catalog + posts V3 results

The eval daemon no longer ships hardcoded judge prompts. It fetches the server's runtime catalog, reconciles the run against it, and posts versioned (V3) results — so prompt edits land via the admin catalog with no CLI release after this one-time bump.

What's here (7 commits)

  • DTOs + V3 payload (Models.cs): EvalCatalogDto/EvalCatalogQuestionDto, additive nullable prompt_version on EvalQuestionDto/EvalQuestionVerdict, SessionEvalCompletedPayloadV3 — all source-gen registered (AOT-clean).
  • EvalCatalogClient: fetches GET /api/eval/catalog with fail-fast validation (null → run aborts).
  • EvalService reconciliation + V3 posting: PrepareAsync reconciles the run question list from the catalog (rendered prompt + raw text + version + needs_tools, in selected order); Aggregate stamps each verdict's prompt_version; FinalizeAsync posts V3 to /evals/v3, filling the retrospective {TRACE_JSON} from the daemon's trace; IEvalObserver/EvalCommand/daemon EvalRunner flipped to V3.
  • WireMock integration tests: V3 post wire-shape, catalog fetch/reconcile e2e, tools routing (the four needs_tools ids), and an alias raw-text double-wrap guard.
  • Cleanup: dropped the now-dead embedded text-question + retrospective prompts (tools wrapper kept).

Notes

  • AOT: source-gen JSON only; dotnet publish clean (0 IL2026/IL3050).
  • Tools path keeps the embedded prompt-eval-question-tools.txt (catalog has no tools template — follow-up tracked).
  • Built/tested against WireMock (Phase-2-independent); the live end-to-end is covered in the paired kcap-server PR.
  • Reviewed via subagent-driven development (per-task spec+quality reviews + a final whole-branch review: ready to merge).

Merge order: this kcap-cli PR merges FIRST; the paired kcap-server PR then re-points its src/cli submodule at the merged commit before it merges.

🤖 Generated with Claude Code

realtonyyoung and others added 7 commits June 29, 2026 17:49
…ayload

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ith fail-fast validation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…romptVersion stamping, retro TRACE_JSON)

PrepareAsync now reconciles the run question list from the fetched
EvalCatalogDto (rendered prompt + raw text + prompt_version + needs_tools,
in selected order) instead of loading embedded templates. The text path
uses each question's server-rendered prompt directly (stripping any residual
{CACHE_BOUNDARY}); the tools path keeps the embedded wrapper, substituting
the catalog raw question_text. Aggregate stamps each verdict's PromptVersion
from the reconciled questions and now returns SessionEvalCompletedPayloadV3.

FinalizeAsync builds V3, stamps RetrospectivePromptVersion from
ctx.RetrospectivePromptVersion, fills the retrospective {TRACE_JSON} from the
daemon's already-fetched trace (SF#1), and posts via the new
PersistAggregateV3Async to /evals/v3. RunAsync/FinalizeAsync return V3;
EvalCommand.Render and IEvalObserver.OnFinished flip to V3 (all impls updated:
SafeObserver, ConsoleEvalObserver, DaemonEvalObserver, and three test observers).
RunAsync + the daemon's HandlePrepareAsync fetch the catalog and iterate the
reconciled questions; HandleRunQuestionAsync judges the reconciled item by id.

New public seams ReconcileQuestions + BuildTextQuestionPrompt are unit-tested
(EvalServicePromptVersionTests). The legacy text-path BuildQuestionPrompt is
removed and BuildRetrospectivePrompt takes the catalog template + trace.
Embedded prompt-eval-question.txt / prompt-eval-retrospective.txt are no longer
loaded (resource files left for Task 10 to delete).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alog fetch/reconcile, tools routing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@linear-code

linear-code Bot commented Jun 29, 2026

Copy link
Copy Markdown

AI-9

Comment thread src/Capacitor.Cli.Core/Eval/EvalCatalogClient.cs Outdated
…y / null item (review)

A JSON `"questions":null` overrides the [] initializer (NRE on .Count) and a
`[null]` element NREs on field access — neither caught by the HttpRequestException/
JsonException handlers, so the run crashed instead of failing closed. Add explicit
null guards (return null + OnFailed). + 2 regression tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alexeyzimarev

Copy link
Copy Markdown
Member

/agentic_review

@qodo-code-review

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (1) 📜 Skill insights (0)

Grey Divider


Action required

1. README missing eval V3 prerequisite 📘 Rule violation ⚙ Maintainability
Description
The eval flow now hard-requires GET /api/eval/catalog and persists results to `POST
/api/sessions/{id}/evals/v3`, which changes CLI prerequisites/behavior when run against older
servers. README.md’s kcap eval documentation is not updated to reflect this new server
requirement, violating the documentation update requirement for user-facing CLI surface changes.
Code

src/Capacitor.Cli.Core/Eval/EvalService.cs[R281-286]

+            // AI-9 Phase 3 — fetch the full catalog (rendered prompts + raw text +
+            // versions) so PrepareAsync can reconcile the run question list from it.
+            var catalog = await EvalCatalogClient.FetchAsync(baseUrl, httpClient, observer, ct);
+            if (catalog is null) return null;   // FetchAsync already emitted OnFailed
+
+            var ctx = await PrepareAsync(baseUrl, httpClient, sessionId, questions, catalog, chain, thresholdBytes, observer, ct, model, evalRunId);
Evidence
The PR introduces a new mandatory runtime dependency for kcap eval: it fetches the eval catalog
from /api/eval/catalog and aborts the run on failure, and it persists results to the V3 endpoint.
README.md’s Session evaluation (LLM-as-judge) section does not mention this new server
prerequisite/compatibility requirement, so the user-facing CLI docs are not updated alongside the
behavior change.

CLAUDE.md: Update README.md in the same PR for any user-facing CLI surface changes
src/Capacitor.Cli.Core/Eval/EvalService.cs[281-286]
src/Capacitor.Cli.Core/Eval/EvalCatalogClient.cs[21-28]
src/Capacitor.Cli.Core/Eval/EvalService.cs[689-705]
README.md[241-257]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`kcap eval` now depends on new server endpoints (`/api/eval/catalog` and `/api/sessions/{id}/evals/v3`) and fails fast if the catalog endpoint is unavailable. This is a user-facing prerequisite change (server compatibility) but README.md was not updated in this PR.

## Issue Context
- The CLI now aborts eval runs when `GET /api/eval/catalog` fails.
- The CLI now persists eval results to the V3 route.
- README.md currently documents `kcap eval` behavior without mentioning the new server requirement/compatibility expectation.

## Fix Focus Areas
- src/Capacitor.Cli.Core/Eval/EvalService.cs[281-286]
- src/Capacitor.Cli.Core/Eval/EvalService.cs[689-705]
- src/Capacitor.Cli.Core/Eval/EvalCatalogClient.cs[21-28]
- README.md[241-257]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Retrospective re-embeds full trace 🐞 Bug ☼ Reliability
Description
RunRetrospectiveAsync fills the catalog retrospective template’s {TRACE_JSON} placeholder with
ctx.TraceJson, which can be large enough to overflow the judge model context and undermines the
existing “force tools when trace is big” strategy. For large traces this can make the retrospective
call fail (HTTP 400 / max turns) or become unnecessarily expensive, despite the retrospective
already having MCP tools for on-demand inspection.
Code

src/Capacitor.Cli.Core/Eval/EvalService.cs[R1211-1214]

+        // AI-9 Phase 3: template comes from the catalog and {TRACE_JSON} is
+        // filled with the daemon's already-fetched trace (SF#1).
+        var prompt = BuildRetrospectivePrompt(
+            retrospectivePrompt, sessionMeta, verdictsJson, knownPatterns: "", traceJson);
Evidence
EvalService explicitly documents that embedding {TRACE_JSON} can overflow context and uses a
token-budget gate to route judge calls through the tools path instead of embedding the trace.
However, the retrospective prompt builder now always replaces {TRACE_JSON} with the full
traceJson, reintroducing the large-prompt risk for retrospective synthesis.

src/Capacitor.Cli.Core/Eval/EvalService.cs[190-199]
src/Capacitor.Cli.Core/Eval/EvalService.cs[407-413]
src/Capacitor.Cli.Core/Eval/EvalService.cs[1205-1215]
src/Capacitor.Cli.Core/Eval/EvalService.cs[796-807]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The retrospective prompt is now built by replacing `{TRACE_JSON}` with the full compacted trace, reintroducing the prompt-size risk that `ShouldForceTools` was designed to avoid.

## Issue Context
- The code explicitly documents that embedding `{TRACE_JSON}` can overflow the judge model context window and routes **per-question** judging through the tools path when trace size exceeds a budget.
- The retrospective path also has MCP tools enabled, so it can function without embedding the full trace.

## Fix Focus Areas
- Gate the retrospective trace substitution based on the same size heuristic used for per-question routing (or reuse `ctx.ForceTools`). When the trace is “too large”, replace `{TRACE_JSON}` with an empty string (or a short marker) rather than embedding the full JSON.
- Keep behavior consistent with the intent documented in the surrounding comments.

- src/Capacitor.Cli.Core/Eval/EvalService.cs[190-218]
- src/Capacitor.Cli.Core/Eval/EvalService.cs[1205-1236]
- src/Capacitor.Cli.Core/Eval/EvalService.cs[796-807]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread src/Capacitor.Cli.Core/Eval/EvalService.cs
kcap eval now fetches its question catalog from GET /api/eval/catalog and
posts results to POST /api/sessions/{id}/evals/v3, failing fast against a
server that doesn't expose the catalog endpoint. Document this server
prerequisite in the Session evaluation section (Qodo review on #200).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@realtonyyoung realtonyyoung merged commit 1862530 into main Jun 30, 2026
5 checks passed
@realtonyyoung realtonyyoung deleted the tonyyoung/ai-9-phase-3-daemon-catalog branch June 30, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants