[AI-9] Phase 3: daemon consumes eval catalog + posts V3 results by realtonyyoung · Pull Request #200 · kurrent-io/kcap-cli

realtonyyoung · 2026-06-29T23:14:12Z

AI-9 Phase 3 (kcap-cli) — daemon consumes the eval catalog + posts V3 results

The eval daemon no longer ships hardcoded judge prompts. It fetches the server's runtime catalog, reconciles the run against it, and posts versioned (V3) results — so prompt edits land via the admin catalog with no CLI release after this one-time bump.

What's here (7 commits)

DTOs + V3 payload (Models.cs): EvalCatalogDto/EvalCatalogQuestionDto, additive nullable prompt_version on EvalQuestionDto/EvalQuestionVerdict, SessionEvalCompletedPayloadV3 — all source-gen registered (AOT-clean).
EvalCatalogClient: fetches GET /api/eval/catalog with fail-fast validation (null → run aborts).
EvalService reconciliation + V3 posting: PrepareAsync reconciles the run question list from the catalog (rendered prompt + raw text + version + needs_tools, in selected order); Aggregate stamps each verdict's prompt_version; FinalizeAsync posts V3 to /evals/v3, filling the retrospective {TRACE_JSON} from the daemon's trace; IEvalObserver/EvalCommand/daemon EvalRunner flipped to V3.
WireMock integration tests: V3 post wire-shape, catalog fetch/reconcile e2e, tools routing (the four needs_tools ids), and an alias raw-text double-wrap guard.
Cleanup: dropped the now-dead embedded text-question + retrospective prompts (tools wrapper kept).

Notes

AOT: source-gen JSON only; dotnet publish clean (0 IL2026/IL3050).
Tools path keeps the embedded prompt-eval-question-tools.txt (catalog has no tools template — follow-up tracked).
Built/tested against WireMock (Phase-2-independent); the live end-to-end is covered in the paired kcap-server PR.
Reviewed via subagent-driven development (per-task spec+quality reviews + a final whole-branch review: ready to merge).

Merge order: this kcap-cli PR merges FIRST; the paired kcap-server PR then re-points its src/cli submodule at the merged commit before it merges.

🤖 Generated with Claude Code

…ayload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ith fail-fast validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…romptVersion stamping, retro TRACE_JSON) PrepareAsync now reconciles the run question list from the fetched EvalCatalogDto (rendered prompt + raw text + prompt_version + needs_tools, in selected order) instead of loading embedded templates. The text path uses each question's server-rendered prompt directly (stripping any residual {CACHE_BOUNDARY}); the tools path keeps the embedded wrapper, substituting the catalog raw question_text. Aggregate stamps each verdict's PromptVersion from the reconciled questions and now returns SessionEvalCompletedPayloadV3. FinalizeAsync builds V3, stamps RetrospectivePromptVersion from ctx.RetrospectivePromptVersion, fills the retrospective {TRACE_JSON} from the daemon's already-fetched trace (SF#1), and posts via the new PersistAggregateV3Async to /evals/v3. RunAsync/FinalizeAsync return V3; EvalCommand.Render and IEvalObserver.OnFinished flip to V3 (all impls updated: SafeObserver, ConsoleEvalObserver, DaemonEvalObserver, and three test observers). RunAsync + the daemon's HandlePrepareAsync fetch the catalog and iterate the reconciled questions; HandleRunQuestionAsync judges the reconciled item by id. New public seams ReconcileQuestions + BuildTextQuestionPrompt are unit-tested (EvalServicePromptVersionTests). The legacy text-path BuildQuestionPrompt is removed and BuildRetrospectivePrompt takes the catalog template + trace. Embedded prompt-eval-question.txt / prompt-eval-retrospective.txt are no longer loaded (resource files left for Task 10 to delete). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…alog fetch/reconcile, tools routing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…log-provided)

… server alias already raw)

qodo-code-review · 2026-06-29T23:14:17Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

linear-code · 2026-06-29T23:14:20Z

AI-9

…y / null item (review) A JSON `"questions":null` overrides the [] initializer (NRE on .Count) and a `[null]` element NREs on field access — neither caught by the HttpRequestException/ JsonException handlers, so the run crashed instead of failing closed. Add explicit null guards (return null + OnFailed). + 2 regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexeyzimarev · 2026-06-30T10:25:31Z

/agentic_review

qodo-code-review · 2026-06-30T10:29:18Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (1) 📜 Skill insights (0)

1. README missing eval V3 prerequisite 📘 Rule violation ⚙ Maintainability

Description

The eval flow now hard-requires GET /api/eval/catalog and persists results to `POST
/api/sessions/{id}/evals/v3`, which changes CLI prerequisites/behavior when run against older
servers. README.md’s kcap eval documentation is not updated to reflect this new server
requirement, violating the documentation update requirement for user-facing CLI surface changes.

Code

src/Capacitor.Cli.Core/Eval/EvalService.cs[R281-286]

+            // AI-9 Phase 3 — fetch the full catalog (rendered prompts + raw text +
+            // versions) so PrepareAsync can reconcile the run question list from it.
+            var catalog = await EvalCatalogClient.FetchAsync(baseUrl, httpClient, observer, ct);
+            if (catalog is null) return null;   // FetchAsync already emitted OnFailed
+
+            var ctx = await PrepareAsync(baseUrl, httpClient, sessionId, questions, catalog, chain, thresholdBytes, observer, ct, model, evalRunId);

Evidence
The PR introduces a new mandatory runtime dependency for kcap eval: it fetches the eval catalog
from /api/eval/catalog and aborts the run on failure, and it persists results to the V3 endpoint.
README.md’s Session evaluation (LLM-as-judge) section does not mention this new server
prerequisite/compatibility requirement, so the user-facing CLI docs are not updated alongside the
behavior change.
CLAUDE.md: Update README.md in the same PR for any user-facing CLI surface changes
src/Capacitor.Cli.Core/Eval/EvalService.cs[281-286]
src/Capacitor.Cli.Core/Eval/EvalCatalogClient.cs[21-28]
src/Capacitor.Cli.Core/Eval/EvalService.cs[689-705]
README.md[241-257]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`kcap eval` now depends on new server endpoints (`/api/eval/catalog` and `/api/sessions/{id}/evals/v3`) and fails fast if the catalog endpoint is unavailable. This is a user-facing prerequisite change (server compatibility) but README.md was not updated in this PR.

## Issue Context
- The CLI now aborts eval runs when `GET /api/eval/catalog` fails.
- The CLI now persists eval results to the V3 route.
- README.md currently documents `kcap eval` behavior without mentioning the new server requirement/compatibility expectation.

## Fix Focus Areas
- src/Capacitor.Cli.Core/Eval/EvalService.cs[281-286]
- src/Capacitor.Cli.Core/Eval/EvalService.cs[689-705]
- src/Capacitor.Cli.Core/Eval/EvalCatalogClient.cs[21-28]
- README.md[241-257]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Retrospective re-embeds full trace 🐞 Bug ☼ Reliability

Description

RunRetrospectiveAsync fills the catalog retrospective template’s {TRACE_JSON} placeholder with
ctx.TraceJson, which can be large enough to overflow the judge model context and undermines the
existing “force tools when trace is big” strategy. For large traces this can make the retrospective
call fail (HTTP 400 / max turns) or become unnecessarily expensive, despite the retrospective
already having MCP tools for on-demand inspection.

Code

src/Capacitor.Cli.Core/Eval/EvalService.cs[R1211-1214]

+        // AI-9 Phase 3: template comes from the catalog and {TRACE_JSON} is
+        // filled with the daemon's already-fetched trace (SF#1).
+        var prompt = BuildRetrospectivePrompt(
+            retrospectivePrompt, sessionMeta, verdictsJson, knownPatterns: "", traceJson);

Evidence
EvalService explicitly documents that embedding {TRACE_JSON} can overflow context and uses a
token-budget gate to route judge calls through the tools path instead of embedding the trace.
However, the retrospective prompt builder now always replaces {TRACE_JSON} with the full
traceJson, reintroducing the large-prompt risk for retrospective synthesis.
src/Capacitor.Cli.Core/Eval/EvalService.cs[190-199]
src/Capacitor.Cli.Core/Eval/EvalService.cs[407-413]
src/Capacitor.Cli.Core/Eval/EvalService.cs[1205-1215]
src/Capacitor.Cli.Core/Eval/EvalService.cs[796-807]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The retrospective prompt is now built by replacing `{TRACE_JSON}` with the full compacted trace, reintroducing the prompt-size risk that `ShouldForceTools` was designed to avoid.

## Issue Context
- The code explicitly documents that embedding `{TRACE_JSON}` can overflow the judge model context window and routes **per-question** judging through the tools path when trace size exceeds a budget.
- The retrospective path also has MCP tools enabled, so it can function without embedding the full trace.

## Fix Focus Areas
- Gate the retrospective trace substitution based on the same size heuristic used for per-question routing (or reuse `ctx.ForceTools`). When the trace is “too large”, replace `{TRACE_JSON}` with an empty string (or a short marker) rather than embedding the full JSON.
- Keep behavior consistent with the intent documented in the surrounding comments.

- src/Capacitor.Cli.Core/Eval/EvalService.cs[190-218]
- src/Capacitor.Cli.Core/Eval/EvalService.cs[1205-1236]
- src/Capacitor.Cli.Core/Eval/EvalService.cs[796-807]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

kcap eval now fetches its question catalog from GET /api/eval/catalog and posts results to POST /api/sessions/{id}/evals/v3, failing fast against a server that doesn't expose the catalog endpoint. Document this server prerequisite in the Session evaluation section (Qodo review on #200). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

qodo-code-review · 2026-06-30T12:13:24Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

realtonyyoung and others added 7 commits June 29, 2026 17:49

[AI-9] Phase 3: CLI catalog DTOs + per-question prompt_version + V3 p…

f67a94f

…ayload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AI-9] Phase 3 Task 2: EvalCatalogClient -- fetch /api/eval/catalog w…

3a4632a

…ith fail-fast validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AI-9] Phase 3 Tasks 6/6b/6c: CLI WireMock integration — V3 post, cat…

c25fbb2

…alog fetch/reconcile, tools routing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AI-9] Phase 3 Task 6c: add non-tools negative case (review fix)

2d2653d

[AI-9] Phase 3 Task 10: drop now-dead embedded prompt resources (cata…

6d4f95d

…log-provided)

[AI-9] Phase 3 Task 8: alias raw-text regression guard (CLI WireMock;…

322561f

… server alias already raw)

realtonyyoung commented Jun 29, 2026

View reviewed changes

Comment thread src/Capacitor.Cli.Core/Eval/EvalCatalogClient.cs Outdated

qodo-code-review Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread src/Capacitor.Cli.Core/Eval/EvalService.cs

realtonyyoung merged commit 1862530 into main Jun 30, 2026
5 checks passed

realtonyyoung deleted the tonyyoung/ai-9-phase-3-daemon-catalog branch June 30, 2026 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AI-9] Phase 3: daemon consumes eval catalog + posts V3 results#200

[AI-9] Phase 3: daemon consumes eval catalog + posts V3 results#200
realtonyyoung merged 9 commits into
mainfrom
tonyyoung/ai-9-phase-3-daemon-catalog

realtonyyoung commented Jun 29, 2026

Uh oh!

qodo-code-review Bot commented Jun 29, 2026

Uh oh!

linear-code Bot commented Jun 29, 2026

Uh oh!

Uh oh!

alexeyzimarev commented Jun 30, 2026

Uh oh!

qodo-code-review Bot commented Jun 30, 2026

Uh oh!

Uh oh!

qodo-code-review Bot commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

realtonyyoung commented Jun 29, 2026

AI-9 Phase 3 (kcap-cli) — daemon consumes the eval catalog + posts V3 results

What's here (7 commits)

Notes

Uh oh!

qodo-code-review Bot commented Jun 29, 2026

Qodo reviews are paused for this user.

Uh oh!

linear-code Bot commented Jun 29, 2026

Uh oh!

Uh oh!

alexeyzimarev commented Jun 30, 2026

Uh oh!

qodo-code-review Bot commented Jun 30, 2026

Code Review by Qodo

Uh oh!

Uh oh!

qodo-code-review Bot commented Jun 30, 2026

Qodo reviews are paused for this user.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants