Refactor Skill(bump-rust-sdk) to align with current workflow & best practices for skill evaluation#7909
Refactor Skill(bump-rust-sdk) to align with current workflow & best practices for skill evaluation#7909theMickster wants to merge 1 commit into
Conversation
…y best practices for skill evaluation
🤖 Bitwarden Claude Code ReviewOverall Assessment: APPROVE Reviewed the Code Review DetailsNo findings at or above the reporting threshold. Notes verified during review (no action required):
One observation, not blocking: the frontmatter splits activation triggers into a |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7909 +/- ##
==========================================
- Coverage 61.28% 61.28% -0.01%
==========================================
Files 2226 2226
Lines 98296 98296
Branches 8884 8884
==========================================
- Hits 60242 60241 -1
- Misses 35934 35935 +1
Partials 2120 2120 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
🎟️ Tracking
N/A
📔 Objective
description+when_to_useper the Claude Code skills spec. Less context loaded on every trigger, one authoritative procedure instead of an overview echoing a detail file.description/when_to_useevals/) — behavior cases + a lean README — closing the gap the Bitwarden AI Review Guidelines flag: "the absence of evals is itself the finding."skill-creator) that caught a regression & confirmed the skill's load-bearing value (details below).bitwarden-cryptocommit and misdirect future bumps. The eval set now protects that correctness. Zero-knowledge invariant is unaffected — RustSdk isutil/test infrastructure, not production runtime.Why now?
We needed to bump the RustSdk project. See #7903 for more details and the quantifiable results of having a quality Claude Skill
🧪 Testing
Everything here is framed against our AI Review Guidelines. Their review gate for a net-new or changed LLM capability is: an eval set of real cases, a recorded baseline, and additions traceable to a case that fails without them — with ablation as the additive linkage and three-tier blinding for rigor. This skill had replaced a core method (the NPM→git-SHA mapping) with zero eval coverage. This PR closes that gap — and is intended as our reference example of doing evals correctly.
The
evals/folderevals.jsonskill-creatorschema — each a realprompt, anexpected_output, and binaryexpectations. Cases 4 & 5 carry anotesfield recording their ablation outcome (see below).README.mdskill-creatorBenchmark mode). Deliberately bare — no fluff.The behavior benchmark
Run through
skill-creator's existing-skill benchmark flow: for each case, with-skill vs. baseline (no skill), 3 independent runs per config, graded by an independent, config-blind grader.Methodology — 24 runs + 4 blind graders, mutation-safe
SKILL.md+ references; baseline runs answer from their own expertise and may read repo code but not the skill.cargo/git/npm/network) — mutation-safe against the working tree. Each wrote its answer to an opaquely-labeled file (A–F; the with/baseline mapping withheld from the grader).expectations, binary pass/fail, no partial credit (perskill-creator'sgrader.md), without knowing which file came from which config.sdk-internal/clientsclones that aren't available in the eval environment. Stated, not silently dropped.Results
cargo update -pevals.jsonnotes)What the benchmark caught (the point of doing it)
An earlier single run suggested case 4 (
cargo update -p) was "unearned" — baseline recommends it anyway — so we trimmed the instruction. The 3× blind re-run on the trimmed skill exposed that as a regression: without the explicit line, the skill's own Step 5cargo buildbecomes a distractor and 2/3 with-skill reps recommend build-driven re-resolution, dropping below baseline (33% vs 100%). The instruction was earned; we restored it (the restored skill was not re-benchmarked at 3× — the earlier single run with the instruction present had scored 2/2 on this case). The rigorous run also revised the single-run reads on cases 3 (parity → clear win) and 5 (clear win → parity). This is exactly why the Guidelines require a recorded baseline and ablation rather than a one-shot impression.Per-case blind grades (A/C/E = with-skill, B/D/F = baseline — mapping withheld from grader)
Case 2 (E1 reject run-number · E2 WASM-SHA method · E3 don't endorse run-number)
Case 3 (E1 channel→1.88.0 · E2 match MSRV not dev-toolchain · E3 rustc-too-old failure)
Case 4 (E1 targeted
cargo update -p· E2 bare-update-churn) — skill trimmed for this runCase 5 (E1 compile-fail · E2
make(SymmetricKeyAlgorithm::Aes256CbcHmac)· E3 import enum)The blind grader clustered A/C/E vs B/D/F cleanly on cases 2–4 with no config labels, and on case 5 the two flaky reps split one per config (with-skill
C, baselineF) — evidence the blinding held and it wasn't rubber-stamping "with-skill = better."Three-tier blinding (per the Guidelines)
A–Fwith the with/baseline mapping withheld and graded each on its own substance.Caveats (stated, not buried)
api-surface.mdis at2.0.0, so with-skill's case-5 answers came from the worked example, not the surface doc.grading.jsonper case) were kept in/tmp, out of the repo.Ablation outcomes recorded in
evals.jsonPer the Guidelines, an instruction is only earned if removing it regresses a case. Both borderline cases now carry their verdict in-file:
cargo update -pregressed with-skill to ~33% (below baseline's 100%), because Step 5'scargo buildmisdirects without it. Load-bearing precisely because it counteracts the skill's own steer.