Skip to content

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #35905

@github-actions

Description

@github-actions

🔬 Improve Experiment Infrastructure: Schema, Reporting & Audit

Triggered by: ab-testing-advisor on 2026-05-30
Parent campaign: #aw_filedieta


Background

The field-presence-checker agent verified the current state of three candidate schema fields (analysis_type, tags, notify). Two of the three — tags and notify — are only partially implemented: they are parsed and rendered in the picker summary table but have no downstream behavioral effect. This issue tracks completing those fields and proposes concrete reporting and audit-trail improvements.

Field-presence-checker findings summary
Field Status Gap
analysis_type ✅ fully implemented Consumed by experiments_analyze_statistics.go to select the statistical test — no action needed.
tags ⚠️ partial Parsed and displayed in picker output, but no Go code filters, routes, or acts on tags at runtime.
notify ⚠️ partial Parsed into ExperimentNotify{Discussion, Issue} struct and displayed in picker output, but no code delivers notifications.

Area 1: Frontmatter Schema — Complete tags and notify

1a. tags — Add Runtime Filtering

Current gap: cfg.Tags is populated but never consumed outside display.

Proposed change in pkg/workflow/compiler_experiments.go and the CLI:

  • Surface tags in the compiled lock-file env block so pick_experiment.cjs can filter experiments by tag when a --tag flag is supplied.
  • In pkg/cli/experiments_analyze_statistics.go, allow --tag <label> to restrict analysis to experiments bearing that tag (useful for bulk analysis of cost-reduction or quality campaigns).
  • In the daily experiment report workflow, use tags to group experiments by theme in the summary table.
# Example frontmatter usage
experiments:
  prompt_style:
    variants: [detailed, concise]
    tags: [cost-reduction, prompt-engineering]

1b. notify — Implement Notification Delivery

Current gap: cfg.Notify.Discussion and cfg.Notify.Issue are parsed but no code posts the notification.

Proposed change in pkg/cli/experiments_analyze_statistics.go (or a new experiments_notify.go):

// After significance is detected:
if result.PValue < 0.05 && cfg.Notify.Issue != 0 {
    postIssueComment(cfg.Notify.Issue, formatSignificanceReport(result))
}
if result.PValue < 0.05 && cfg.Notify.Discussion != 0 {
    postDiscussionComment(cfg.Notify.Discussion, formatSignificanceReport(result))
}

The pick_experiment.cjs step summary should also surface notify targets so operators can see at a glance where results will be delivered.

# Example frontmatter usage
experiments:
  prompt_style:
    variants: [detailed, concise]
    notify:
      issue: 1234

Area 2: Reporting & Dashboards

Propose a daily-experiment-report workflow (or extension of the existing one) that:

  1. Aggregates run data — downloads the experiments/state.json artifact from each recent workflow run via gh run download, extracts variant and outcome metrics per run.
  2. Computes running statistics — per variant: n, mean, variance, p_value (using analysis_type to select the right test).
  3. Detects significance — when p_value < 0.05 AND n >= min_samples for all variants, marks the experiment as concluded and identifies the winner.
  4. Generates a visual table — ASCII comparison table artifact:
┌─────────────────────────────────────────────────┐
│  Experiment: prompt_style (daily-file-diet)     │
│  Runs: detailed=52  concise=51  total=103        │
├─────────────────┬───────────────┬───────────────┤
│ Metric          │ detailed      │ concise       │
├─────────────────┼───────────────┼───────────────┤
│ Completeness    │ 0.91 ± 0.08   │ 0.88 ± 0.11   │
│ Token count     │ 4,820 ± 310   │ 3,940 ± 290 ✅ │
│ Duration (ms)   │ 48,200        │ 41,100        │
├─────────────────┴───────────────┴───────────────┤
│ p-value: 0.031  Winner: concise (token savings) │
└─────────────────────────────────────────────────┘
  1. Posts results — uses cfg.Notify to post the report to the designated discussion or issue, and calls safeoutputs add_comment with the table.

Area 3: Audit & OTEL Integration

3a. OTEL Span Attributes

In pick_experiment.cjs, after assignment, emit OTEL span attributes:

core.exportVariable('OTEL_RESOURCE_ATTRIBUTES',
  `experiment.name=${experimentName},experiment.variant=${assignedVariant},` +
  process.env.OTEL_RESOURCE_ATTRIBUTES || ''
);

This surfaces experiment.name and experiment.variant on every span in the run, enabling Grafana/Jaeger dashboards to facet traces by experiment without any post-hoc joining.

3b. gh aw audit Integration

  • Add experiment_name and variant columns to the audit log emitted by gh aw audit.
  • Enable gh aw audit --experiment prompt_style --variant concise to filter audit entries to only runs of a given variant.
  • This allows direct comparison of failure modes (e.g., noop-without-output errors) across variants without needing to join on run IDs externally.

3c. Step Summary Enrichment

In pick_experiment.cjs, append to the GitHub Actions step summary:

### 🧪 Experiment Assignment
| Field | Value |
|---|---|
| Name | `prompt_style` |
| Variant | `concise` |
| Run # | 37 |
| Notify targets | issue #1234 |
| Tags | cost-reduction, prompt-engineering |

This makes experiment metadata immediately visible in the Actions run summary without needing to download artifacts.


Implementation Steps

  • tags: Add --tag filtering to gh aw experiments analyze and lock-file env expansion
  • notify: Implement notification delivery in experiments_analyze_statistics.go (or new experiments_notify.go)
  • Reporting: Extend daily-experiment-report workflow to aggregate artifacts, compute stats, render ASCII table, and post via notify
  • OTEL: Emit experiment.name / experiment.variant resource attributes from pick_experiment.cjs
  • Audit: Add experiment columns to gh aw audit output and --experiment / --variant filter flags
  • Step summary: Enrich pick_experiment.cjs step summary with full experiment metadata table

References

  • A/B Testing in gh-aw
  • pkg/workflow/compiler_experiments.go
  • actions/setup/js/pick_experiment.cjs
  • pkg/cli/experiments_analyze_statistics.go

Generated by 🧪 Daily A/B Testing Advisor · sonnet46 1.7M ·

  • expires on Jun 13, 2026, 11:02 AM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions