Skip to content

feat(examples): add reproducible Evaluation + Optimization pipeline#99

Open
Adonis-a233 wants to merge 1 commit into
trpc-group:mainfrom
Adonis-a233:feat/eval-optimize-loop
Open

feat(examples): add reproducible Evaluation + Optimization pipeline#99
Adonis-a233 wants to merge 1 commit into
trpc-group:mainfrom
Adonis-a233:feat/eval-optimize-loop

Conversation

@Adonis-a233

@Adonis-a233 Adonis-a233 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

Implements the reproducible Evaluation + Optimization closed loop for the issue
"设计并实现可复现的 Evaluation + Optimization pipeline".

Resolves #91

What Changed

Adds examples/optimization/eval_optimize_loop/ — a six-stage pipeline around
AgentEvaluator / AgentOptimizer:

  1. Baseline evaluation — train and validation sets scored separately; per-case
    metric sub-scores (final_response / tool_trajectory / rubric), pass/fail,
    failure reasons, and key trace fields.
  2. Failure attribution — rule-based clustering over structured trajectories into
    final_response_mismatch / tool_call_error / parameter_error /
    llm_rubric_not_met / knowledge_recall_insufficient / format_error.
    case_meta.json declares an expected category per case and the report carries an
    attribution accuracy self-check (4/4 = 100% on the bundled sample).
  3. Optimization — fake mode applies a deterministic scripted candidate; live mode
    runs a real GEPA search via AgentOptimizer.optimize + TargetPrompt.add_path.
  4. Candidate validation — full re-run and per-case diff vs baseline
    (new_pass / new_fail / score_up / score_down).
  5. Acceptance gate — five independent configurable checks: validation gain
    threshold, no new hard fail, no key-case regression, no train-up/val-down
    overfit, cost within budget (optimizer spend + token-estimated evaluation spend).
  6. Audit persistence — append-only runs/<timestamp>_<run_id>/ per run with
    prompt snapshots, JSON/Markdown reports, gate reasons, cost/token split,
    duration, GEPA seed, prompt SHA-256, and a full config snapshot; run_id is
    injected into every log line for cross-artifact tracing.

Robustness and engineering:

  • Live agent calls retry with exponential backoff + jitter and a per-call timeout
    (EVAL_OPT_CALL_TIMEOUT / EVAL_OPT_CALL_ATTEMPTS / EVAL_OPT_CALL_BACKOFF).
  • optimizer.json is validated at startup (metric weights must sum to 1.0, all
    gate keys present) with readable errors instead of bare KeyErrors.
  • 33 IO-free unit tests under tests/ cover attribution, rubric scoring, every
    gate check, case diffing, the self-check, and config validation.
  • Generated reports are gitignored (running the example never dirties the tree);
    frozen samples are committed under sample_output/ (.json + .md).
  • The 6 sample cases cover all three required situations: optimizable success,
    optimization-ineffective, and post-optimization regression. README includes the
    requested design note (failure attribution, gating, overfit protection, audit).

Validation

Fake mode (no API key, ~1s, deterministic):

python examples/optimization/eval_optimize_loop/run.py --mode fake

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@8080800). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             main         #99   +/-   ##
==========================================
  Coverage        ?   87.51506%           
==========================================
  Files           ?         467           
  Lines           ?       44005           
  Branches        ?           0           
==========================================
  Hits            ?       38511           
  Misses          ?        5494           
  Partials        ?           0           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Adonis-a233

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

Rook1ex added a commit to trpc-group/cla-database that referenced this pull request Jun 30, 2026
Implements the six-stage Evaluation + Optimization pipeline required by
the issue: baseline evaluation, rule-based failure attribution with an
in-report accuracy self-check, candidate search (scripted in fake mode,
real GEPA via AgentOptimizer.optimize + TargetPrompt in live mode),
candidate validation with per-case deltas, a validation-first
five-check acceptance gate, and append-only audit persistence under
timestamped runs/ directories.

- fake mode is deterministic and needs no API key or network calls
- live agent bridge retries with exponential backoff and per-call
  timeout, and accumulates token usage so evaluation spend is audited
  alongside optimizer spend in the cost gate
- optimizer.json is validated at startup (metric weights, gate keys)
- attribution, rubric, gate, diff, self-check, and config validation
  are covered by 33 IO-free unit tests under tests/
- generated reports are gitignored; frozen JSON/Markdown samples are
  committed under sample_output/
- README ships a design note covering attribution, gating, overfit
  protection, and the audit trail

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Adonis-a233 Adonis-a233 force-pushed the feat/eval-optimize-loop branch from 781cb4a to d500843 Compare July 3, 2026 11:17
@Adonis-a233 Adonis-a233 changed the title feat: add reproducible Evaluation + Optimization pipeline feat(examples): add reproducible Evaluation + Optimization pipeline Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

构建 Evaluation + Optimization 的自动回归与提示词优化闭环

1 participant