Skip to content

Refactor Skill(bump-rust-sdk) to align with current workflow & best practices for skill evaluation#7909

Open
theMickster wants to merge 1 commit into
mainfrom
seeder/resolve-problems-in-bump-rust-skill
Open

Refactor Skill(bump-rust-sdk) to align with current workflow & best practices for skill evaluation#7909
theMickster wants to merge 1 commit into
mainfrom
seeder/resolve-problems-in-bump-rust-skill

Conversation

@theMickster

Copy link
Copy Markdown
Contributor

🎟️ Tracking

N/A

📔 Objective

  • Consolidated the skill into one lean file & split the frontmatter into description + when_to_use per the Claude Code skills spec. Less context loaded on every trigger, one authoritative procedure instead of an overview echoing a detail file.
  • Purged and refactored outdated guidance that future RustkSdk bumps would have blindly followed
  • Fixed an over-trigger in description/when_to_use
  • Made the skill's first eval set** (evals/) — behavior cases + a lean README — closing the gap the Bitwarden AI Review Guidelines flag: "the absence of evals is itself the finding."
  • Ran a rigorous 3× blind behavior benchmark (via skill-creator) that caught a regression & confirmed the skill's load-bearing value (details below).
  • Why it matters: this skill drives the Seeder's Rust crypto layer; following its stale guidance would pin the wrong bitwarden-crypto commit and misdirect future bumps. The eval set now protects that correctness. Zero-knowledge invariant is unaffected — RustSdk is util/ test infrastructure, not production runtime.

Why now?

We needed to bump the RustSdk project. See #7903 for more details and the quantifiable results of having a quality Claude Skill

🧪 Testing

Everything here is framed against our AI Review Guidelines. Their review gate for a net-new or changed LLM capability is: an eval set of real cases, a recorded baseline, and additions traceable to a case that fails without them — with ablation as the additive linkage and three-tier blinding for rigor. This skill had replaced a core method (the NPM→git-SHA mapping) with zero eval coverage. This PR closes that gap — and is intended as our reference example of doing evals correctly.

The evals/ folder

File Purpose
evals.json 5 behavior cases in the skill-creator schema — each a real prompt, an expected_output, and binary expectations. Cases 4 & 5 carry a notes field recording their ablation outcome (see below).
README.md Brief orientation: what the five cases cover and how to run them (skill-creator Benchmark mode). Deliberately bare — no fluff.

The behavior benchmark

Run through skill-creator's existing-skill benchmark flow: for each case, with-skill vs. baseline (no skill), 3 independent runs per config, graded by an independent, config-blind grader.

Methodology — 24 runs + 4 blind graders, mutation-safe
  • 24 runs = 4 cases × {with-skill, baseline} × 3 reps. With-skill runs read the actual SKILL.md + references; baseline runs answer from their own expertise and may read repo code but not the skill.
  • Every run was read-only / advice-only (no edits, no cargo/git/npm/network) — mutation-safe against the working tree. Each wrote its answer to an opaquely-labeled file (AF; the with/baseline mapping withheld from the grader).
  • 4 independent grader agents (one per case) graded the six files against that case's expectations, binary pass/fail, no partial credit (per skill-creator's grader.md), without knowing which file came from which config.
  • Case 1 (live NPM→SHA derivation) was skipped — it needs the npm registry plus sibling sdk-internal/clients clones that aren't available in the eval environment. Stated, not silently dropped.

Results

Case Checks With-skill (3×) Baseline (3×) Verdict
2 — NPM→git-SHA mapping 3 100% (9/9) 67% (6/9) ✅ Skill win — baseline invents a wrong git-tag/release method; only the skill reaches the WASM-embedded-SHA read
3 — MSRV / toolchain 3 100% (9/9) 67% (6/9) ✅ Skill win — the "match the MSRV, not sdk-internal's dev toolchain" nuance; one baseline rep inverts it
4cargo update -p 2 33% (2/6) ⚠️ trimmed variant 100% (6/6) ⚠️ Regression caught → instruction restored
5 — breaking-change fix 3 78% (7/9) 78% (7/9) ⚖️ Borderline — kept (leans to earning its keep; see evals.json notes)

What the benchmark caught (the point of doing it)

An earlier single run suggested case 4 (cargo update -p) was "unearned" — baseline recommends it anyway — so we trimmed the instruction. The 3× blind re-run on the trimmed skill exposed that as a regression: without the explicit line, the skill's own Step 5 cargo build becomes a distractor and 2/3 with-skill reps recommend build-driven re-resolution, dropping below baseline (33% vs 100%). The instruction was earned; we restored it (the restored skill was not re-benchmarked at 3× — the earlier single run with the instruction present had scored 2/2 on this case). The rigorous run also revised the single-run reads on cases 3 (parity → clear win) and 5 (clear win → parity). This is exactly why the Guidelines require a recorded baseline and ablation rather than a one-shot impression.

Per-case blind grades (A/C/E = with-skill, B/D/F = baseline — mapping withheld from grader)

Case 2 (E1 reject run-number · E2 WASM-SHA method · E3 don't endorse run-number)

File E1 E2 E3 Tally
A (with) 3/3
C (with) 3/3
E (with) 3/3
B (base) 2/3
D (base) 2/3
F (base) 2/3

Case 3 (E1 channel→1.88.0 · E2 match MSRV not dev-toolchain · E3 rustc-too-old failure)

File E1 E2 E3 Tally
A/C/E (with) 3/3 each
B/D (base) 2/3 each
F (base) ✗ (inverts it) 2/3

Case 4 (E1 targeted cargo update -p · E2 bare-update-churn) — skill trimmed for this run

File E1 E2 Tally
A (with) 0/2
C (with) 0/2
E (with) 2/2
B/D/F (base) 2/2 each

Case 5 (E1 compile-fail · E2 make(SymmetricKeyAlgorithm::Aes256CbcHmac) · E3 import enum)

File E1 E2 E3 Tally
A/E (with) 3/3 each
C (with) ✗ (guesses API) 1/3
B/D (base) 3/3 each
F (base) ✗ (guesses API) 1/3

The blind grader clustered A/C/E vs B/D/F cleanly on cases 2–4 with no config labels, and on case 5 the two flaky reps split one per config (with-skill C, baseline F) — evidence the blinding held and it wasn't rubber-stamping "with-skill = better."

Three-tier blinding (per the Guidelines)
  • Subject blind — prompts are phrased as genuine user asks and don't telegraph the expected method; case 2 actively proposes the wrong method and checks the skill resists it.
  • Observer blind — runs executed by independent subagents writing to opaque files; the orchestrator didn't coach mid-run.
  • Grader blind — graders received files AF with the with/baseline mapping withheld and graded each on its own substance.
Caveats (stated, not buried)
  • 3 reps is small. Rates are indicative, not definitive; cases 4-with and 5 (both configs) show genuine run-to-run variance.
  • Baseline runs could read repo code (not the skill); for case 5 they drew the exact API from OSS/training knowledge. api-surface.md is at 2.0.0, so with-skill's case-5 answers came from the worked example, not the surface doc.
  • Graders were config-blind but outputs can self-reference "the skill"; they were instructed to grade substance only, and the case-5 cross-config flakiness indicates they did.
  • Benchmark artifacts (grading.json per case) were kept in /tmp, out of the repo.

Ablation outcomes recorded in evals.json

Per the Guidelines, an instruction is only earned if removing it regresses a case. Both borderline cases now carry their verdict in-file:

  • Case 4 — EARNED (restored). Removing cargo update -p regressed with-skill to ~33% (below baseline's 100%), because Step 5's cargo build misdirects without it. Load-bearing precisely because it counteracts the skill's own steer.
  • Case 5 — borderline, KEPT. Parity in aggregate, but baseline only matched by recalling a version-specific API from memory (one rep guessed a wrong constructor — a wrong key silently entering the Seeder is what this guards), while the with-skill answer comes from cheap on-demand reference content. Retained; revisit only if the example is later shown to mislead.

@theMickster theMickster requested a review from a team as a code owner July 1, 2026 14:37
@theMickster theMickster added ai-review Request a Claude code review t:tech-debt Change Type - Tech debt labels Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

🤖 Bitwarden Claude Code Review

Overall Assessment: APPROVE

Reviewed the bump-rust-sdk Claude skill refactor: the consolidated SKILL.md, the updated /bump-rust-sdk command, the Stop hook message, the removal of references/methodology.md, and the new evals/ set plus references/examples/2026-06-bump.md. Focus was on config correctness — YAML frontmatter, file references, progressive disclosure, and internal consistency of the three-crate → single-crate (bitwarden-crypto) scope narrowing. No security or correctness blockers found.

Code Review Details

No findings at or above the reporting threshold.

Notes verified during review (no action required):

  • The scope narrowing to bitwarden-crypto is applied consistently across SKILL.md, .claude/commands/bump-rust-sdk.md, and the Stop hook message in .claude/hooks/rust-sdk-surface-check.sh.
  • All file references resolve in the PR head: references/api-surface.md, references/examples/2026-06-bump.md, and evals/. The deleted methodology.md is no longer referenced anywhere.
  • SKILL.md is 111 lines (well within the progressive-disclosure guideline) with detail correctly pushed into references/ and evals/.
  • No secrets, no settings.local.json, and no dependency-manifest changes. The zero-knowledge invariant note (RustSdk is util/ test infrastructure, not production) is accurate.

One observation, not blocking: the frontmatter splits activation triggers into a when_to_use field, which no other skill in this repo uses (the sibling skills embed triggers directly in description). The description still carries adequate triggers on its own, and the PR cites the current skills spec, so this is left to the author's discretion — worth a quick confirmation that when_to_use is indexed for skill discovery in the deployed Claude Code version.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.28%. Comparing base (7ae670a) to head (e3ed746).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7909      +/-   ##
==========================================
- Coverage   61.28%   61.28%   -0.01%     
==========================================
  Files        2226     2226              
  Lines       98296    98296              
  Branches     8884     8884              
==========================================
- Hits        60242    60241       -1     
- Misses      35934    35935       +1     
  Partials     2120     2120              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review Request a Claude code review t:tech-debt Change Type - Tech debt

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant