[ab-advisor] Experiment campaign for daily-file-diet: A/B test prompt_style

### 🧪 Experiment Campaign: daily-file-diet

**Workflow file**: `.github/workflows/daily-file-diet.md`
**Selected dimension**: `prompt_style`
**Triggered by**: `ab-testing-advisor` on 2026-05-30

---

### Background

`daily-file-diet` monitors the Go codebase daily, identifies the largest non-test source file, and — when it exceeds 800 lines — opens a GitHub issue with Serena MCP semantic analysis, proposed file splits, test coverage plans, and acceptance criteria. The prompt is currently **dense**: it encodes a full multi-step analysis protocol, a parameterized issue template with progressive-disclosure rules, and explicit MCP usage instructions. This density may be driving unnecessary token consumption and latency without proportional quality gains; a `prompt_style` experiment will determine whether a leaner prompt achieves equivalent output quality.

### Hypothesis

- **H0 (null)**: Changing the prompt style does not meaningfully change the quality or completeness of the generated refactoring issue.
- **H1 (alternative)**: A `concise` prompt style produces issues of equivalent or higher measurable quality while consuming fewer tokens and reducing run duration by ≥15%.

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter:

```yaml
experiments:
  prompt_style:
    variants: [detailed, concise]
    description: "Tests whether a leaner prompt preserves refactoring-issue quality vs. the current verbose multi-step protocol."
    hypothesis: "H0: no change in issue completeness score. H1: concise variant reduces token usage by ≥15% with no significant drop in issue quality (split suggestions present, acceptance criteria present, Serena analysis present)."
    metric: issue_completeness_score
    secondary_metrics: [effective_token_count, run_duration_ms]
    guardrail_metrics:
      - name: issue_creation_success_rate
        direction: min
        threshold: 0.90
      - name: empty_output_rate
        direction: max
        threshold: 0.05
    min_samples: 50
    weight: [50, 50]
    start_date: "2026-05-30"
    issue: #aw_filedieta
```

**Variant descriptions**:
- `detailed` *(baseline)*: Current prompt — full multi-step analysis protocol with explicit Serena MCP usage instructions, parameterized issue template, progressive-disclosure formatting rules.
- `concise`: Compressed prompt retaining the essential intent (find largest file, if ≥800 lines create issue with semantic analysis and split suggestions) while removing template scaffolding and relying on the model's own formatting judgment.

### Workflow Changes Required

In `.github/workflows/daily-file-diet.md`, wrap the prompt body in a conditional block keyed on the `prompt_style` experiment:

<details><summary>View diff</summary>

```diff
-You are a Go code quality expert...
-[full multi-step protocol with template]
+{{#if experiments.prompt_style == "concise" }}
+You are a Go code quality expert. Each weekday:
+1. Find the largest non-test `.go` file by line count (use `find` + `wc -l`).
+2. If it is ≥ 800 lines, use Serena MCP semantic analysis to identify function relationships, complexity hotspots, and module boundary candidates.
+3. Create a GitHub issue titled `[file-diet] <filename> (N lines)` with: a summary of findings, 2–4 concrete split proposals with rationale, a test coverage plan, and an acceptance checklist. Use `<details>` blocks for verbose sections.
+4. If all files are < 800 lines, output a brief ✅ healthy message and call `noop`.
+{{else}}
+[existing detailed prompt verbatim]
+{{/if}}
```

</details>

After editing, run:
```bash
gh aw compile daily-file-diet
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| Issue completeness score (split suggestions ✓ + acceptance criteria ✓ + Serena analysis ✓) | Primary | ≥ 0.85 for both variants |
| Effective token count | Secondary | ≥ 15% reduction in `concise` |
| Run duration (ms) | Secondary | Signal only |
| Issue creation success rate | Guardrail | Must not drop below 90% |
| Empty output rate | Guardrail | Must remain < 5% |

### Statistical Design

- **Variants**: `detailed` (baseline), `concise`
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Minimum runs per variant**: 50 (to detect a 15-percentage-point difference in completeness score at 80% power, two-proportion z-test)
- **Expected experiment duration**: ~100 weekday runs ≈ 20 weeks (workflow runs Mon–Fri; ~5 runs/week)
- **Analysis approach**: Two-proportion z-test for completeness score; two-sample t-test / Mann-Whitney U for token count and duration

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter
- [ ] Add conditional blocks to workflow prompt body using `{{#if experiments.prompt_style == "concise" }}` (value-comparison form — never use the internal `__GH_AW_EXPERIMENTS__` env-var syntax)
- [ ] Run `gh aw compile daily-file-diet` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/agent/experiments/state.json`
- [ ] After ≥ 50 runs per variant, analyze variant distribution via workflow run artifacts
- [ ] Score issues manually (or via a scoring workflow) for completeness: split proposals present, acceptance criteria present, Serena analysis present
- [ ] Document findings and promote winning variant

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/daily-file-diet.md`







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/26682009164) · sonnet46 1.7M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on Jun 13, 2026, 11:02 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for daily-file-diet: A/B test prompt_style #35904

🧪 Experiment Campaign: daily-file-diet

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
Issue completeness score (split suggestions ✓ + acceptance criteria ✓ + Serena analysis ✓)	Primary	≥ 0.85 for both variants
Effective token count	Secondary	≥ 15% reduction in `concise`
Run duration (ms)	Secondary	Signal only
Issue creation success rate	Guardrail	Must not drop below 90%
Empty output rate	Guardrail	Must remain < 5%

[ab-advisor] Experiment campaign for daily-file-diet: A/B test prompt_style #35904

Description

🧪 Experiment Campaign: daily-file-diet

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions