feat(examples): add reproducible Evaluation + Optimization pipeline#99
Open
Adonis-a233 wants to merge 1 commit into
Open
feat(examples): add reproducible Evaluation + Optimization pipeline#99Adonis-a233 wants to merge 1 commit into
Adonis-a233 wants to merge 1 commit into
Conversation
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #99 +/- ##
==========================================
Coverage ? 87.51506%
==========================================
Files ? 467
Lines ? 44005
Branches ? 0
==========================================
Hits ? 38511
Misses ? 5494
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Author
|
I have read the CLA Document and I hereby sign the CLA |
Rook1ex
added a commit
to trpc-group/cla-database
that referenced
this pull request
Jun 30, 2026
Implements the six-stage Evaluation + Optimization pipeline required by the issue: baseline evaluation, rule-based failure attribution with an in-report accuracy self-check, candidate search (scripted in fake mode, real GEPA via AgentOptimizer.optimize + TargetPrompt in live mode), candidate validation with per-case deltas, a validation-first five-check acceptance gate, and append-only audit persistence under timestamped runs/ directories. - fake mode is deterministic and needs no API key or network calls - live agent bridge retries with exponential backoff and per-call timeout, and accumulates token usage so evaluation spend is audited alongside optimizer spend in the cost gate - optimizer.json is validated at startup (metric weights, gate keys) - attribution, rubric, gate, diff, self-check, and config validation are covered by 33 IO-free unit tests under tests/ - generated reports are gitignored; frozen JSON/Markdown samples are committed under sample_output/ - README ships a design note covering attribution, gating, overfit protection, and the audit trail Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
781cb4a to
d500843
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the reproducible Evaluation + Optimization closed loop for the issue
"设计并实现可复现的 Evaluation + Optimization pipeline".
Resolves #91
What Changed
Adds
examples/optimization/eval_optimize_loop/— a six-stage pipeline aroundAgentEvaluator/AgentOptimizer:metric sub-scores (final_response / tool_trajectory / rubric), pass/fail,
failure reasons, and key trace fields.
final_response_mismatch/tool_call_error/parameter_error/llm_rubric_not_met/knowledge_recall_insufficient/format_error.case_meta.jsondeclares an expected category per case and the report carries anattribution accuracy self-check (4/4 = 100% on the bundled sample).
runs a real GEPA search via
AgentOptimizer.optimize+TargetPrompt.add_path.(
new_pass/new_fail/score_up/score_down).threshold, no new hard fail, no key-case regression, no train-up/val-down
overfit, cost within budget (optimizer spend + token-estimated evaluation spend).
runs/<timestamp>_<run_id>/per run withprompt snapshots, JSON/Markdown reports, gate reasons, cost/token split,
duration, GEPA seed, prompt SHA-256, and a full config snapshot;
run_idisinjected into every log line for cross-artifact tracing.
Robustness and engineering:
(
EVAL_OPT_CALL_TIMEOUT/EVAL_OPT_CALL_ATTEMPTS/EVAL_OPT_CALL_BACKOFF).optimizer.jsonis validated at startup (metric weights must sum to 1.0, allgate keys present) with readable errors instead of bare
KeyErrors.tests/cover attribution, rubric scoring, everygate check, case diffing, the self-check, and config validation.
frozen samples are committed under
sample_output/(.json+.md).optimization-ineffective, and post-optimization regression. README includes the
requested design note (failure attribution, gating, overfit protection, audit).
Validation
Fake mode (no API key, ~1s, deterministic):