[AI-9] Phase 3: daemon consumes eval catalog + posts V3 results#200
Conversation
…ayload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ith fail-fast validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…romptVersion stamping, retro TRACE_JSON)
PrepareAsync now reconciles the run question list from the fetched
EvalCatalogDto (rendered prompt + raw text + prompt_version + needs_tools,
in selected order) instead of loading embedded templates. The text path
uses each question's server-rendered prompt directly (stripping any residual
{CACHE_BOUNDARY}); the tools path keeps the embedded wrapper, substituting
the catalog raw question_text. Aggregate stamps each verdict's PromptVersion
from the reconciled questions and now returns SessionEvalCompletedPayloadV3.
FinalizeAsync builds V3, stamps RetrospectivePromptVersion from
ctx.RetrospectivePromptVersion, fills the retrospective {TRACE_JSON} from the
daemon's already-fetched trace (SF#1), and posts via the new
PersistAggregateV3Async to /evals/v3. RunAsync/FinalizeAsync return V3;
EvalCommand.Render and IEvalObserver.OnFinished flip to V3 (all impls updated:
SafeObserver, ConsoleEvalObserver, DaemonEvalObserver, and three test observers).
RunAsync + the daemon's HandlePrepareAsync fetch the catalog and iterate the
reconciled questions; HandleRunQuestionAsync judges the reconciled item by id.
New public seams ReconcileQuestions + BuildTextQuestionPrompt are unit-tested
(EvalServicePromptVersionTests). The legacy text-path BuildQuestionPrompt is
removed and BuildRetrospectivePrompt takes the catalog template + trace.
Embedded prompt-eval-question.txt / prompt-eval-retrospective.txt are no longer
loaded (resource files left for Task 10 to delete).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alog fetch/reconcile, tools routing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… server alias already raw)
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
…y / null item (review) A JSON `"questions":null` overrides the [] initializer (NRE on .Count) and a `[null]` element NREs on field access — neither caught by the HttpRequestException/ JsonException handlers, so the run crashed instead of failing closed. Add explicit null guards (return null + OnFailed). + 2 regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/agentic_review |
Code Review by Qodo
1. README missing eval V3 prerequisite
|
kcap eval now fetches its question catalog from GET /api/eval/catalog and
posts results to POST /api/sessions/{id}/evals/v3, failing fast against a
server that doesn't expose the catalog endpoint. Document this server
prerequisite in the Session evaluation section (Qodo review on #200).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
AI-9 Phase 3 (kcap-cli) — daemon consumes the eval catalog + posts V3 results
The eval daemon no longer ships hardcoded judge prompts. It fetches the server's runtime catalog, reconciles the run against it, and posts versioned (V3) results — so prompt edits land via the admin catalog with no CLI release after this one-time bump.
What's here (7 commits)
Models.cs):EvalCatalogDto/EvalCatalogQuestionDto, additive nullableprompt_versiononEvalQuestionDto/EvalQuestionVerdict,SessionEvalCompletedPayloadV3— all source-gen registered (AOT-clean).EvalCatalogClient: fetchesGET /api/eval/catalogwith fail-fast validation (null → run aborts).EvalServicereconciliation + V3 posting:PrepareAsyncreconciles the run question list from the catalog (rendered prompt + raw text + version +needs_tools, in selected order);Aggregatestamps each verdict'sprompt_version;FinalizeAsyncposts V3 to/evals/v3, filling the retrospective{TRACE_JSON}from the daemon's trace;IEvalObserver/EvalCommand/daemonEvalRunnerflipped to V3.needs_toolsids), and an alias raw-text double-wrap guard.Notes
dotnet publishclean (0 IL2026/IL3050).prompt-eval-question-tools.txt(catalog has no tools template — follow-up tracked).Merge order: this kcap-cli PR merges FIRST; the paired kcap-server PR then re-points its
src/clisubmodule at the merged commit before it merges.🤖 Generated with Claude Code