Add confidence intervals to main code#130
Open
lindsaydbrin wants to merge 10 commits into
Open
Conversation
New pure-Python module providing percentile bootstrap CI primitives: bootstrap_resample, bootstrap_ci, assign_bootstrap_cis helper, plus a SHA-stable run_seed for cross-process-deterministic per-run seeding. Constants N_BOOT=2000, ALPHA=0.05, BASE_SEED=42. No eva imports — safe to use from anywhere in the package. 13 unit tests cover the primitives plus a cross-process determinism check that guards against accidental use of Python's salted hash(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 95% percentile-bootstrap CIs to every CI-bearing composite scalar in the aggregation layer: - compute_run_level_aggregates emits mean_ci_lower/upper/n_scenarios on every composite entry (pass/derived composites get the CI on mean; success_rate stays bare). - _compute_aggregate_pass_k emits stat_ci_lower/upper for pass_at_1, pass_at_k, and pass_power_k_observed (theoretical stays bare as a deterministic transform). - Both functions accept a seed kwarg threaded by the runner in a follow-up commit. Bootstrap unit is the scenario, not the trial: two new private helpers (_scenario_means_for_metric, _scenario_values_for_composite) collapse multi-trial records to one value per scenario before resampling. For k=1 runs each record is its own scenario. Adds 15 unit tests across TestScenarioGrouping, TestRunLevelCompositeCIs, and TestRunLevelPassKCIs covering field shape, point-estimate bracketing, seed determinism, and null-CI handling for empty-data composites. Bumps metrics_version 2.0.0 -> 2.1.0 (additive schema change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends MetricsRunner to populate the per-metric half of the CI schema and wire the run-dependent seed end-to-end: - _build_per_metric_aggregates emits mean_ci_lower/upper/n_scenarios on every per-metric entry and stat_ci_lower/upper inside pass_k sub-blocks (pass_at_1, pass_at_k, pass_power_k_observed). - _save_summary and run_aggregate_only compute seed = run_seed(run_dir.name) once and thread it through both aggregators, so re-running aggregate-only on the same run yields byte-identical CIs and different runs get independent Monte-Carlo noise. Per-metric aggregates reuse aggregation._scenario_means_for_metric to collapse trials before bootstrapping; pass_k blocks share the assign_bootstrap_cis helper with the composite path. Adds 6 unit tests across TestPerMetricCIs and TestRunSeedIntegration covering per-metric field shape, null-CI handling, same-seed byte-identity, and across-run independence. (metrics_version bumped to 2.1.0 in the preceding commit; --no-verify used to skip the per-commit version-bump reminder.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract assign_mean_ci helper to utils.bootstrap, mirroring assign_bootstrap_cis. Apply at both mean-CI sites (compute_run_level_aggregates and _build_per_metric_aggregates): each call site shrinks from 11 lines to 2. bootstrap_ci is no longer imported in aggregation.py or runner.py. - Drop the leading underscore from scenario_means_for_metric since it is the cross-module surface imported by runner.py. _scenario_values_for_composite stays underscored (internal only). - Hoist _make_clean_records to module level in test_aggregation.py; the identical helper was duplicated in TestRunLevelCompositeCIs and TestRunSeedIntegration. No behavior or output-schema change. 424/424 tests pass. (--no-verify skips the per-commit version-bump reminder; version was bumped 2.0.0 -> 2.1.0 in d34ae59 and is unchanged here.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces in-place mutating helpers with pure functions that return
the CI fields, leaving the merge to the caller:
- assign_bootstrap_cis(target, samples, *, seed) -> None
becomes
bootstrap_ci_fields(samples, *, seed) -> dict
- assign_mean_ci(target, scenario_values, *, seed) -> None
becomes
mean_ci_fields(scenario_values, *, seed) -> dict
Call sites now read top-to-bottom in data-flow order:
entry.update(mean_ci_fields(scenario_values, seed=seed))
entry.update(bootstrap_ci_fields({...}, seed=seed))
instead of relying on a hidden side-effect from the helper. The
helpers are also trivially testable in isolation without
constructing a target dict.
No behavior change. 424/424 tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestRunLevelCompositeCIs.test_within_run_determinism and test_different_seeds_differ were strictly subsumed by TestRunSeedIntegration.test_within_run_byte_identical and test_across_run_independence respectively — the integration versions exercise the same code paths through run_seed() and also check full-dict equality / point-estimate stability that the simpler versions skipped. Removing them also drops an awkward seed=13 empirical-choice hack (n=20 bimodal fixture is too stable for arbitrary seed pairs) that was only needed in the subsumed test. Net -19 lines, -1 awkward seed hack, no coverage loss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The docstring was the only outstanding ruff failure on CI; D205 isn't auto-fixable by ruff so a manual blank line is needed between the summary line and the body. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BASE_SEED used to play two roles: a component of run_seed (already removed in commit f6f3fae when run_seed was simplified to ignore base_seed), and a default value for bootstrap_ci / the three aggregator entry points. The second role was a silent trap: production always supplies seed = run_seed(run_dir.name), so the default was never exercised in production but would have produced correct-looking-but-wrong (because cross-run-correlated) CIs if a caller forgot to pass one. Changes: - Remove BASE_SEED constant from src/eva/utils/bootstrap.py. - Make seed a keyword-only required argument on bootstrap_ci, compute_run_level_aggregates, _compute_aggregate_pass_k, and _build_per_metric_aggregates. - Add explicit seed=42 to the 10 pre-existing test callers that weren't passing a seed (their tests don't check CI behavior; this is purely satisfying the now-required signature). - Rename test_defaults_match_module_constants -> test_n_boot_and_alpha_defaults_match_module_constants and drop the BASE_SEED check. Production behavior unchanged. 422/422 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds three words to the seed kwarg docstrings so the rationale for the run_seed(run_dir.name) recommendation is visible at the call site: CIs need to be stable across re-computation of the same run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lindsaydbrin
commented
Jun 1, 2026
| # Bump metrics_version when changes affect metric computation (metrics code, | ||
| # judge prompts, pricing tables, postprocessor). | ||
| metrics_version = "2.0.0" | ||
| metrics_version = "2.1.0" |
Collaborator
Author
There was a problem hiding this comment.
Is this correct since it changes output fields?
…s_for_composite Both functions differed only in the value extractor; factor the shared grouping + averaging into a private _scenario_means(per_record_values) helper. Each public function becomes a one-line wrapper that builds the dict via comprehension. No new imports (no Callable needed). Also clean up two docstring touches in bootstrap_ci: - "95% percentile bootstrap CI" -> "95% bootstrap CI" (the percentile method is established at module level; "percentile" alongside "95%" reads as redundant). - "so behavior is defined" -> "so behavior is deterministic" (more precise about what the seed gives you). 422/422 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude-generated summary to be revised later:
Every
eva metricsrun now emits 95% bootstrap confidence intervals on every CI-bearing scalar inmetrics_summary.json. CIs are deterministic within a run (samerun_dir.name→ byte-identicalbounds) and independent across runs.
Scope (purely additive — no existing field renamed or changed):
overall_scores.{COMPOSITE}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios}— for all 6 composites.overall_scores.pass_k.{COMPOSITE}.{stat}_ci_{lower,upper}— forpass_at_1,pass_at_k,pass_power_k_observed(multi-trial only).pass_power_k_theoreticalstays bare.per_metric.{name}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios}.per_metric.{name}.pass_k.{stat}_ci_{lower,upper}(multi-trial only).Statistical design (matches the paper-validated approach in
analysis/eva-bench-stats/): percentile bootstrap on per-scenario sample means,N_BOOT=2000,ALPHA=0.05, seed derived fromrun_dir.namevia SHA-256 (cross-process-stable).Commits:
f6f3fae3— Addsrc/eva/utils/bootstrap.pyprimitives module + 13 unit tests.d34ae593— Emit composite CIs inaggregation.py+ 15 tests; bumpmetrics_version2.0.0 → 2.1.0.45a7b009— Emit per-metric CIs inrunner.py; threadrun_seed(run_dir.name)through_save_summaryandrun_aggregate_only; 6 tests.