Skip to content

Add confidence intervals to main code#130

Open
lindsaydbrin wants to merge 10 commits into
mainfrom
pr/lindsay/add_CIs
Open

Add confidence intervals to main code#130
lindsaydbrin wants to merge 10 commits into
mainfrom
pr/lindsay/add_CIs

Conversation

@lindsaydbrin
Copy link
Copy Markdown
Collaborator

Claude-generated summary to be revised later:

Every eva metrics run now emits 95% bootstrap confidence intervals on every CI-bearing scalar in metrics_summary.json. CIs are deterministic within a run (same run_dir.name → byte-identical
bounds) and independent across runs.

Scope (purely additive — no existing field renamed or changed):

  • overall_scores.{COMPOSITE}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios} — for all 6 composites.
  • overall_scores.pass_k.{COMPOSITE}.{stat}_ci_{lower,upper} — for pass_at_1, pass_at_k, pass_power_k_observed (multi-trial only). pass_power_k_theoretical stays bare.
  • per_metric.{name}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios}.
  • per_metric.{name}.pass_k.{stat}_ci_{lower,upper} (multi-trial only).

Statistical design (matches the paper-validated approach in analysis/eva-bench-stats/): percentile bootstrap on per-scenario sample means, N_BOOT=2000, ALPHA=0.05, seed derived from
run_dir.name via SHA-256 (cross-process-stable).

Commits:

  1. f6f3fae3 — Add src/eva/utils/bootstrap.py primitives module + 13 unit tests.
  2. d34ae593 — Emit composite CIs in aggregation.py + 15 tests; bump metrics_version 2.0.0 → 2.1.0.
  3. 45a7b009 — Emit per-metric CIs in runner.py; thread run_seed(run_dir.name) through _save_summary and run_aggregate_only; 6 tests.

lindsaydbrin and others added 9 commits May 29, 2026 11:59
New pure-Python module providing percentile bootstrap CI primitives:
bootstrap_resample, bootstrap_ci, assign_bootstrap_cis helper, plus a
SHA-stable run_seed for cross-process-deterministic per-run seeding.
Constants N_BOOT=2000, ALPHA=0.05, BASE_SEED=42. No eva imports — safe
to use from anywhere in the package.

13 unit tests cover the primitives plus a cross-process determinism
check that guards against accidental use of Python's salted hash().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 95% percentile-bootstrap CIs to every CI-bearing composite scalar
in the aggregation layer:

- compute_run_level_aggregates emits mean_ci_lower/upper/n_scenarios on
  every composite entry (pass/derived composites get the CI on mean;
  success_rate stays bare).
- _compute_aggregate_pass_k emits stat_ci_lower/upper for pass_at_1,
  pass_at_k, and pass_power_k_observed (theoretical stays bare as a
  deterministic transform).
- Both functions accept a seed kwarg threaded by the runner in a
  follow-up commit.

Bootstrap unit is the scenario, not the trial: two new private helpers
(_scenario_means_for_metric, _scenario_values_for_composite) collapse
multi-trial records to one value per scenario before resampling. For
k=1 runs each record is its own scenario.

Adds 15 unit tests across TestScenarioGrouping, TestRunLevelCompositeCIs,
and TestRunLevelPassKCIs covering field shape, point-estimate bracketing,
seed determinism, and null-CI handling for empty-data composites.

Bumps metrics_version 2.0.0 -> 2.1.0 (additive schema change).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends MetricsRunner to populate the per-metric half of the CI schema
and wire the run-dependent seed end-to-end:

- _build_per_metric_aggregates emits mean_ci_lower/upper/n_scenarios
  on every per-metric entry and stat_ci_lower/upper inside pass_k
  sub-blocks (pass_at_1, pass_at_k, pass_power_k_observed).
- _save_summary and run_aggregate_only compute seed = run_seed(run_dir.name)
  once and thread it through both aggregators, so re-running aggregate-only
  on the same run yields byte-identical CIs and different runs get
  independent Monte-Carlo noise.

Per-metric aggregates reuse aggregation._scenario_means_for_metric to
collapse trials before bootstrapping; pass_k blocks share the
assign_bootstrap_cis helper with the composite path.

Adds 6 unit tests across TestPerMetricCIs and TestRunSeedIntegration
covering per-metric field shape, null-CI handling, same-seed
byte-identity, and across-run independence.

(metrics_version bumped to 2.1.0 in the preceding commit; --no-verify
used to skip the per-commit version-bump reminder.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract assign_mean_ci helper to utils.bootstrap, mirroring
  assign_bootstrap_cis. Apply at both mean-CI sites
  (compute_run_level_aggregates and _build_per_metric_aggregates):
  each call site shrinks from 11 lines to 2. bootstrap_ci is no
  longer imported in aggregation.py or runner.py.
- Drop the leading underscore from scenario_means_for_metric since
  it is the cross-module surface imported by runner.py.
  _scenario_values_for_composite stays underscored (internal only).
- Hoist _make_clean_records to module level in test_aggregation.py;
  the identical helper was duplicated in TestRunLevelCompositeCIs
  and TestRunSeedIntegration.

No behavior or output-schema change. 424/424 tests pass.
(--no-verify skips the per-commit version-bump reminder; version
was bumped 2.0.0 -> 2.1.0 in d34ae59 and is unchanged here.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces in-place mutating helpers with pure functions that return
the CI fields, leaving the merge to the caller:

- assign_bootstrap_cis(target, samples, *, seed) -> None
  becomes
  bootstrap_ci_fields(samples, *, seed) -> dict
- assign_mean_ci(target, scenario_values, *, seed) -> None
  becomes
  mean_ci_fields(scenario_values, *, seed) -> dict

Call sites now read top-to-bottom in data-flow order:

  entry.update(mean_ci_fields(scenario_values, seed=seed))
  entry.update(bootstrap_ci_fields({...}, seed=seed))

instead of relying on a hidden side-effect from the helper. The
helpers are also trivially testable in isolation without
constructing a target dict.

No behavior change. 424/424 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestRunLevelCompositeCIs.test_within_run_determinism and
test_different_seeds_differ were strictly subsumed by
TestRunSeedIntegration.test_within_run_byte_identical and
test_across_run_independence respectively — the integration
versions exercise the same code paths through run_seed() and
also check full-dict equality / point-estimate stability that
the simpler versions skipped.

Removing them also drops an awkward seed=13 empirical-choice
hack (n=20 bimodal fixture is too stable for arbitrary seed
pairs) that was only needed in the subsumed test.

Net -19 lines, -1 awkward seed hack, no coverage loss.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The docstring was the only outstanding ruff failure on CI; D205
isn't auto-fixable by ruff so a manual blank line is needed
between the summary line and the body.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BASE_SEED used to play two roles: a component of run_seed (already
removed in commit f6f3fae when run_seed was simplified to ignore
base_seed), and a default value for bootstrap_ci / the three
aggregator entry points. The second role was a silent trap:
production always supplies seed = run_seed(run_dir.name), so the
default was never exercised in production but would have produced
correct-looking-but-wrong (because cross-run-correlated) CIs if a
caller forgot to pass one.

Changes:
- Remove BASE_SEED constant from src/eva/utils/bootstrap.py.
- Make seed a keyword-only required argument on bootstrap_ci,
  compute_run_level_aggregates, _compute_aggregate_pass_k, and
  _build_per_metric_aggregates.
- Add explicit seed=42 to the 10 pre-existing test callers that
  weren't passing a seed (their tests don't check CI behavior;
  this is purely satisfying the now-required signature).
- Rename test_defaults_match_module_constants ->
  test_n_boot_and_alpha_defaults_match_module_constants and
  drop the BASE_SEED check.

Production behavior unchanged. 422/422 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds three words to the seed kwarg docstrings so the rationale for
the run_seed(run_dir.name) recommendation is visible at the call
site: CIs need to be stable across re-computation of the same run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/eva/__init__.py
# Bump metrics_version when changes affect metric computation (metrics code,
# judge prompts, pricing tables, postprocessor).
metrics_version = "2.0.0"
metrics_version = "2.1.0"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct since it changes output fields?

…s_for_composite

Both functions differed only in the value extractor; factor the shared
grouping + averaging into a private _scenario_means(per_record_values)
helper. Each public function becomes a one-line wrapper that builds
the dict via comprehension. No new imports (no Callable needed).

Also clean up two docstring touches in bootstrap_ci:
- "95% percentile bootstrap CI" -> "95% bootstrap CI" (the percentile
  method is established at module level; "percentile" alongside "95%"
  reads as redundant).
- "so behavior is defined" -> "so behavior is deterministic" (more
  precise about what the seed gives you).

422/422 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lindsaydbrin lindsaydbrin marked this pull request as ready for review June 1, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant