Skip to content

[ab-advisor] Experiment campaign for daily-file-diet: A/B test prompt_style #35904

@github-actions

Description

@github-actions

🧪 Experiment Campaign: daily-file-diet

Workflow file: .github/workflows/daily-file-diet.md
Selected dimension: prompt_style
Triggered by: ab-testing-advisor on 2026-05-30


Background

daily-file-diet monitors the Go codebase daily, identifies the largest non-test source file, and — when it exceeds 800 lines — opens a GitHub issue with Serena MCP semantic analysis, proposed file splits, test coverage plans, and acceptance criteria. The prompt is currently dense: it encodes a full multi-step analysis protocol, a parameterized issue template with progressive-disclosure rules, and explicit MCP usage instructions. This density may be driving unnecessary token consumption and latency without proportional quality gains; a prompt_style experiment will determine whether a leaner prompt achieves equivalent output quality.

Hypothesis

  • H0 (null): Changing the prompt style does not meaningfully change the quality or completeness of the generated refactoring issue.
  • H1 (alternative): A concise prompt style produces issues of equivalent or higher measurable quality while consuming fewer tokens and reducing run duration by ≥15%.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  prompt_style:
    variants: [detailed, concise]
    description: "Tests whether a leaner prompt preserves refactoring-issue quality vs. the current verbose multi-step protocol."
    hypothesis: "H0: no change in issue completeness score. H1: concise variant reduces token usage by ≥15% with no significant drop in issue quality (split suggestions present, acceptance criteria present, Serena analysis present)."
    metric: issue_completeness_score
    secondary_metrics: [effective_token_count, run_duration_ms]
    guardrail_metrics:
      - name: issue_creation_success_rate
        direction: min
        threshold: 0.90
      - name: empty_output_rate
        direction: max
        threshold: 0.05
    min_samples: 50
    weight: [50, 50]
    start_date: "2026-05-30"
    issue: #aw_filedieta

Variant descriptions:

  • detailed (baseline): Current prompt — full multi-step analysis protocol with explicit Serena MCP usage instructions, parameterized issue template, progressive-disclosure formatting rules.
  • concise: Compressed prompt retaining the essential intent (find largest file, if ≥800 lines create issue with semantic analysis and split suggestions) while removing template scaffolding and relying on the model's own formatting judgment.

Workflow Changes Required

In .github/workflows/daily-file-diet.md, wrap the prompt body in a conditional block keyed on the prompt_style experiment:

View diff
-You are a Go code quality expert...
-[full multi-step protocol with template]
+{{#if experiments.prompt_style == "concise" }}
+You are a Go code quality expert. Each weekday:
+1. Find the largest non-test `.go` file by line count (use `find` + `wc -l`).
+2. If it is ≥ 800 lines, use Serena MCP semantic analysis to identify function relationships, complexity hotspots, and module boundary candidates.
+3. Create a GitHub issue titled `[file-diet] <filename> (N lines)` with: a summary of findings, 2–4 concrete split proposals with rationale, a test coverage plan, and an acceptance checklist. Use `<details>` blocks for verbose sections.
+4. If all files are < 800 lines, output a brief ✅ healthy message and call `noop`.
+{{else}}
+[existing detailed prompt verbatim]
+{{/if}}

After editing, run:

gh aw compile daily-file-diet

Success Metrics

Metric Type Target
Issue completeness score (split suggestions ✓ + acceptance criteria ✓ + Serena analysis ✓) Primary ≥ 0.85 for both variants
Effective token count Secondary ≥ 15% reduction in concise
Run duration (ms) Secondary Signal only
Issue creation success rate Guardrail Must not drop below 90%
Empty output rate Guardrail Must remain < 5%

Statistical Design

  • Variants: detailed (baseline), concise
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 50 (to detect a 15-percentage-point difference in completeness score at 80% power, two-proportion z-test)
  • Expected experiment duration: ~100 weekday runs ≈ 20 weeks (workflow runs Mon–Fri; ~5 runs/week)
  • Analysis approach: Two-proportion z-test for completeness score; two-sample t-test / Mann-Whitney U for token count and duration

Implementation Steps

  • Add experiments: section to frontmatter
  • Add conditional blocks to workflow prompt body using {{#if experiments.prompt_style == "concise" }} (value-comparison form — never use the internal __GH_AW_EXPERIMENTS__ env-var syntax)
  • Run gh aw compile daily-file-diet to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/agent/experiments/state.json
  • After ≥ 50 runs per variant, analyze variant distribution via workflow run artifacts
  • Score issues manually (or via a scoring workflow) for completeness: split proposals present, acceptance criteria present, Serena analysis present
  • Document findings and promote winning variant

References

Generated by 🧪 Daily A/B Testing Advisor · sonnet46 1.7M ·

  • expires on Jun 13, 2026, 11:02 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions