Skip to content

test: rewrite eval runtime span tests to exercise real code#1566

Open
smflorentino wants to merge 7 commits intomainfrom
fix/rewrite-eval-runtime-spans-tests
Open

test: rewrite eval runtime span tests to exercise real code#1566
smflorentino wants to merge 7 commits intomainfrom
fix/rewrite-eval-runtime-spans-tests

Conversation

@smflorentino
Copy link
Copy Markdown
Collaborator

@smflorentino smflorentino commented Apr 14, 2026

Summary

  • All ~60 tests in test_eval_runtime_spans.py were tautological — they constructed dicts inline and asserted against those same dicts without calling any production code
  • Replaced with 24 tests that all run the full evaluation pipeline via execute() and assert on the resulting span tree
  • Only execute_runtime (the actual agent invocation) is mocked — initiate_evaluation, _execute_eval, run_evaluator, compute_evaluator_scores, and all span configuration functions run for real
  • Each test class asserts on a different level of the span hierarchy:
    • TestEvalSetRunSpan — batch-level span with aggregate scores, metadata values, schema content, status, and conditional eval_set_run_id
    • TestEvaluationSpan — per-item spans with item attributes, scores, input data, agentId value, and error/success status
    • TestEvaluatorSpan — per-evaluator spans with evaluator attributes
    • TestEvaluationOutputSpan — score output spans with justification extraction (pydantic, string, none) verified in both direct attributes and serialized output JSON
    • TestSpanHierarchy — correct span ordering and counts
  • No production code changes

Test plan

  • pytest tests/cli/eval/test_eval_runtime_spans.py -v — all 24 tests pass
  • Full test suite (1728 tests) passes with no regressions

🤖 Generated with Claude Code

smflorentino and others added 7 commits April 14, 2026 06:24
All ~60 tests in test_eval_runtime_spans.py were tautological — they
constructed dicts inline and asserted against those same dicts without
ever calling production code (e.g. `assert {"span_type": "eval_set_run"}["span_type"] == "eval_set_run"`).

Replace with 24 tests that inject a SpanCapturingTracer into the
runtime, call the real methods (execute, _execute_eval, run_evaluator),
and assert on the captured spans.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All ~60 tests in test_eval_runtime_spans.py were tautological — they
constructed dicts inline and asserted against those same dicts without
ever calling production code.

Replace with 25 tests that inject a SpanCapturingTracer into the
runtime, call the real methods (execute, _execute_eval, run_evaluator),
and assert on the captured spans.

TestEvalSetRunSpan tests run the full pipeline: execute() flows real
eval items through _execute_eval() and run_evaluator(), then verifies
the parent span has aggregate scores and metadata.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every test now runs the full pipeline: execute() -> initiate_evaluation()
-> _execute_eval() per item -> run_evaluator() per evaluator. Only
execute_runtime (the agent invocation) is mocked.

Each test class asserts on a different level of the span tree that the
evaluation produces, rather than testing isolated methods disconnected
from the pipeline they belong to.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace terse abbreviations (Acc, rel, i1, E1) with meaningful names
that convey what is being evaluated (ExactMatchEvaluator,
calculator-addition, sentiment-positive-review, etc.).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are mocks, not real evaluator implementations — name them
accordingly. Tests that need one evaluator use the default
MockEvaluator; tests that need multiple use MockEvaluatorA/B/C.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test that the evaluation span gets StatusCode.ERROR when the agent
  returns an error, and StatusCode.OK on success
- Test that the evaluation span carries the eval item's inputs as
  serialized JSON in the input attribute

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Assert agentId value (not just existence) on both eval set run and
  evaluation spans
- Assert inputSchema/outputSchema content matches the runtime schema
- Assert StatusCode.OK on the eval set run span
- Assert evaluatorId and justification in the evaluation output span's
  output JSON (not just as direct span attributes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant