feat(examples): add reproducible Evaluation + Optimization pipeline by Adonis-a233 · Pull Request #99 · trpc-group/trpc-agent-python

Adonis-a233 · 2026-06-30T16:02:03Z

Summary

Implements the reproducible Evaluation + Optimization closed loop for the issue
"设计并实现可复现的 Evaluation + Optimization pipeline".

Resolves #91

What Changed

Adds examples/optimization/eval_optimize_loop/ — a six-stage pipeline around
AgentEvaluator / AgentOptimizer:

Baseline evaluation — train and validation sets scored separately; per-case
metric sub-scores (final_response / tool_trajectory / rubric), pass/fail,
failure reasons, and key trace fields.
Failure attribution — rule-based clustering over structured trajectories into
final_response_mismatch / tool_call_error / parameter_error /
llm_rubric_not_met / knowledge_recall_insufficient / format_error.
case_meta.json declares an expected category per case and the report carries an
attribution accuracy self-check (4/4 = 100% on the bundled sample).
Optimization — fake mode applies a deterministic scripted candidate; live mode
runs a real GEPA search via AgentOptimizer.optimize + TargetPrompt.add_path.
Candidate validation — full re-run and per-case diff vs baseline
(new_pass / new_fail / score_up / score_down).
Acceptance gate — five independent configurable checks: validation gain
threshold, no new hard fail, no key-case regression, no train-up/val-down
overfit, cost within budget (optimizer spend + token-estimated evaluation spend).
Audit persistence — append-only runs/<timestamp>_<run_id>/ per run with
prompt snapshots, JSON/Markdown reports, gate reasons, cost/token split,
duration, GEPA seed, prompt SHA-256, and a full config snapshot; run_id is
injected into every log line for cross-artifact tracing.

Robustness and engineering:

Live agent calls retry with exponential backoff + jitter and a per-call timeout
(EVAL_OPT_CALL_TIMEOUT / EVAL_OPT_CALL_ATTEMPTS / EVAL_OPT_CALL_BACKOFF).
optimizer.json is validated at startup (metric weights must sum to 1.0, all
gate keys present) with readable errors instead of bare KeyErrors.
33 IO-free unit tests under tests/ cover attribution, rubric scoring, every
gate check, case diffing, the self-check, and config validation.
Generated reports are gitignored (running the example never dirties the tree);
frozen samples are committed under sample_output/ (.json + .md).
The 6 sample cases cover all three required situations: optimizable success,
optimization-ineffective, and post-optimization regression. README includes the
requested design note (failure attribution, gating, overfit protection, audit).

Validation

Fake mode (no API key, ~1s, deterministic):

python examples/optimization/eval_optimize_loop/run.py --mode fake

github-actions · 2026-06-30T16:02:15Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

codecov · 2026-06-30T16:05:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@8080800). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             main         #99   +/-   ##
==========================================
  Coverage        ?   87.51506%           
==========================================
  Files           ?         467           
  Lines           ?       44005           
  Branches        ?           0           
==========================================
  Hits            ?       38511           
  Misses          ?        5494           
  Partials        ?           0

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adonis-a233 · 2026-06-30T16:10:25Z

I have read the CLA Document and I hereby sign the CLA

Implements the six-stage Evaluation + Optimization pipeline required by the issue: baseline evaluation, rule-based failure attribution with an in-report accuracy self-check, candidate search (scripted in fake mode, real GEPA via AgentOptimizer.optimize + TargetPrompt in live mode), candidate validation with per-case deltas, a validation-first five-check acceptance gate, and append-only audit persistence under timestamped runs/ directories. - fake mode is deterministic and needs no API key or network calls - live agent bridge retries with exponential backoff and per-call timeout, and accumulates token usage so evaluation spend is audited alongside optimizer spend in the cost gate - optimizer.json is validated at startup (metric weights, gate keys) - attribution, rubric, gate, diff, self-check, and config validation are covered by 33 IO-free unit tests under tests/ - generated reports are gitignored; frozen JSON/Markdown samples are committed under sample_output/ - README ships a design note covering attribution, gating, overfit protection, and the audit trail Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Rook1ex added a commit to trpc-group/cla-database that referenced this pull request Jun 30, 2026

@Adonis-a233 has signed the CLA in trpc-group/trpc-agent-python#99

a39731c

Adonis-a233 force-pushed the feat/eval-optimize-loop branch from 781cb4a to d500843 Compare July 3, 2026 11:17

Adonis-a233 changed the title ~~feat: add reproducible Evaluation + Optimization pipeline~~ feat(examples): add reproducible Evaluation + Optimization pipeline Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(examples): add reproducible Evaluation + Optimization pipeline#99

feat(examples): add reproducible Evaluation + Optimization pipeline#99
Adonis-a233 wants to merge 1 commit into
trpc-group:mainfrom
Adonis-a233:feat/eval-optimize-loop

Adonis-a233 commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Adonis-a233 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Adonis-a233 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Validation

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Adonis-a233 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adonis-a233 commented Jun 30, 2026 •

edited

Loading

github-actions Bot commented Jun 30, 2026 •

edited

Loading

codecov Bot commented Jun 30, 2026 •

edited

Loading