diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 0298983..af38f8a 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -4,14 +4,3 @@ When reviewing a pull request, load and apply: https://github.com/AbsaOSS/agentic-toolkit/blob/master/skills/pr-review/SKILL.md - -### Skill-specific checks — apply when any `SKILL.md` is modified - -For every modified `SKILL.md`, also verify: -- `name` is kebab-case, matches the directory name, and is ≤ 64 chars -- `description` covers both "what it does" AND "when to trigger" with explicit trigger keywords -- `description` is ≤ 1024 chars and not padded with filler -- SKILL.md body is < 500 lines, or uses progressive disclosure via `references/` -- No hardcoded credentials, secrets, or absolute internal paths in skill body or scripts -- Any bundled script in `scripts/` is referenced from SKILL.md with clear usage guidance -- The new or modified skill's description does not conflict with or shadow existing skills diff --git a/.github/workflows/check_pr_release_notes.yml b/.github/workflows/check_pr_release_notes.yml index af3761e..646e098 100644 --- a/.github/workflows/check_pr_release_notes.yml +++ b/.github/workflows/check_pr_release_notes.yml @@ -21,3 +21,4 @@ jobs: github-repository: ${{ github.repository }} pr-number: ${{ github.event.number }} skip-labels: "no RN" + title: "## [Rr]elease [Nn]otes" diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7a213fc..66e1e74 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -227,5 +227,6 @@ Before opening a pull request, verify: - [ ] No hardcoded credentials, secrets, or internal paths in skill body or scripts - [ ] Any script in `scripts/` is referenced from `SKILL.md` with usage guidance - [ ] New skill's description does not conflict with or shadow existing skills +- [ ] Skill added to the catalog table in `README.md` - [ ] Evals exist (or a note explains why they are not applicable) - [ ] `skills-ref validate ./skills/my-skill` passes (install: `pip install skills-ref`) diff --git a/README.md b/README.md index f90df00..96210d5 100644 --- a/README.md +++ b/README.md @@ -75,7 +75,10 @@ For the full guide — what skills are, how they activate, project-scoped instal Browse all available skills in the **[skills/](./skills/)** directory — each skill folder contains a `SKILL.md` with its purpose, trigger phrases, and full instructions. -> The catalog table will be populated as skills are added. See `skills/` for the current set. +| Skill | Description | +|------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| +| **[pr-review](./skills/pr-review/)** | Pull request code review — reviews diffs for risk, security issues, API contract changes, dependency bumps, CI/CD and infrastructure changes. Produces concise Blocker / Important / Nit comments. | +| **[token-saving](./skills/token-saving/)** | Always-active response discipline — enforces brevity, no filler openers or closers, structured output, and a What/Why/How footer on code responses. Suspends on explicit "full detail" requests. | ## Finding More Skills @@ -87,7 +90,7 @@ Before building a new skill, check whether one already exists: | [skills.sh](https://skills.sh) | Open registry — install with `npx skills add ` | | [anthropics/skills](https://github.com/anthropics/skills) | Anthropic reference skills including `skill-creator` | | [absa-group/agent-skills](https://github.com/absa-group/agent-skills) | Broader ABSA-owned skill collection | -| [absa-group/cps-agentic-toolkit](https://github.com/absa-group/cps-agentic-toolkit) | CPS team's skill set built on top of this repo | +| [absa-group/cps-agentic-toolkit](https://github.com/absa-group/cps-agentic-toolkit) | CPS team's extended skill set (ABSA-internal) | ## Contributing @@ -116,3 +119,4 @@ Claude, Cursor, Windsurf, and custom pipelines. ## Troubleshooting Setup issues and common fixes are covered in **[docs/troubleshooting.md](./docs/troubleshooting.md)**. +All documentation guides are indexed at **[docs/](./docs/)**. diff --git a/docs/README.md b/docs/README.md index a13b50d..351b430 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,13 +2,22 @@ Navigation hub for all guides in this repository. Browse by category below. -| Guide | Audience | Description | -|-----------------------------------------|----------|------------------------------------------------------------------------------------| -| [Getting Started](./getting-started.md) | Users | What skills are, how to install them, Copilot CLI usage | -| [Troubleshooting](./troubleshooting.md) | Users | Setup guides and fixes for install, activation, and proxy issues | -| [Contributing](../CONTRIBUTING.md) | Authors | Skill folder layout, frontmatter, description writing, body guidelines, PR process | -| [Skill Testing](./skill-testing.md) | Authors | Eval creation, fixtures, regression loops, trigger and description optimization | - -> **Keep this index up to date.** When you add a new guide under `docs/`, add a row to the table above. +## Setup & Repository Guides + +| Guide | Description | +|-----------------------------------------|-------------------------------------------------------------------------------------| +| [Getting Started](./getting-started.md) | What skills are, how to install them, Copilot CLI usage | +| [Contributing](../CONTRIBUTING.md) | Skill folder layout, frontmatter, description writing, body guidelines, PR process | +| [Skill Testing](./skill-testing.md) | Eval creation, fixtures, regression loops, trigger and description optimization | +| [Troubleshooting](./troubleshooting.md) | Setup fixes for install, activation, and proxy issues | + +## Skill Guides + +| Guide | Description | +|-------------------------------------|------------------------------------------------------------------------------------| +| [PR Review](./pr-review.md) | How the PR review skill works, what sections it applies, and how to trigger it | +| [Token Saving](./token-saving.md) | Keeping AI responses concise — how the token-saving skill works and when it applies | + +> **Keep this index up to date.** When you add a new guide, add a row to the appropriate table above. See also the [main README](../README.md) for the skill catalog, scope, and FAQ. diff --git a/docs/getting-started.md b/docs/getting-started.md index d5402a3..2c9b085 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -132,7 +132,7 @@ Project skills take precedence over global skills when both exist. ### Add project-specific skills For skills that only apply to a specific repository, place them in `.github/skills/` within that repo. These are loaded -automatically when Copilot CLI is launched from that directory, layered on top of your personal and CPS base skills. +automatically when Copilot CLI is launched from that directory, layered on top of your personal and shared base skills. ``` your-project-repo/ diff --git a/docs/pr-review.md b/docs/pr-review.md new file mode 100644 index 0000000..bdb57e0 --- /dev/null +++ b/docs/pr-review.md @@ -0,0 +1,101 @@ +# PR Review Skill + +The `pr-review` skill performs structured, risk-aware code review on pull requests and diffs. It activates automatically when you ask for PR feedback, share a diff, or ask for a sanity check on changes. + +--- + +## What it does + +The skill reads the list of files changed in a PR, selects the relevant review sections, and produces a concise review grouped by severity: + +| Severity | Meaning | +|----------|---------| +| **Blocker** | Must be fixed before merge — correctness bug, security vulnerability, broken contract, credentials in code, data loss risk | +| **Important** | Should be fixed soon — increases risk, missing tests for changed logic, non-default that's hard to change later | +| **Nit** | Minor polish — style, naming, optional improvement, non-urgent edge case | + +If a section has no findings, the skill writes `_None._` to show it was checked, not skipped. + +--- + +## Sections + +The skill always applies **Standard review** and selects additional sections based on the files the PR touches: + +| Section | Triggers when PR touches | +|---|---| +| Standard | Always applied | +| API contracts | Endpoints, DTOs, config keys, env vars, exit codes, output strings | +| Dependency bumps | `pom.xml`, `requirements.txt`, `pyproject.toml`, `package.json`, `*.csproj`, `go.mod`, `build.sbt` | +| CI/CD | `.github/workflows/`, `Jenkinsfile`, `Justfile`, `Makefile` (pipeline), deployment jobs | +| Infrastructure | `*.tf`, `*.tfvars`, CloudFormation, Helm, Dockerfiles, `docker-compose.yml`, `iac/`, `infra/` | +| DB migrations | `alembic/`, `flyway/`, `liquibase/`, `migrations/`, `*.sql`, `db/` | +| Elevated risk | Overlay — applied on top when PR touches auth, security controls, secrets management, or is a wide refactor (≥ 10 call sites) | + +A file may match multiple sections — all matching sections are applied. + +--- + +## How to trigger it + +Ask for a review naturally — the skill fires on intent, not exact wording: + +``` +review my PR +any issues with these changes? +LGTM? +does this look right? +check this diff for risks +can you review this before I merge? +``` + +You can also share a GitHub PR link directly: + +``` +https://github.com/org/repo/pull/123 — does this look safe to merge? +``` + +> **Does NOT trigger** for generative tasks on a diff: "generate release notes from this diff", "summarise this diff". Those are documentation tasks. + +--- + +## Helpers + +The skill ships two scripts to speed up data gathering: + +| Script | Purpose | +|--------|---------| +| `scripts/fetch_pr.sh ` | Fetches the diff and file list via `gh` | +| `scripts/classify_sections.py /tmp/pr_files.txt` | Prints which sections to apply given a file list | + +These require the [GitHub CLI](https://cli.github.com/) (`gh`) to be authenticated. + +--- + +## Overriding or scoping the review + +You can narrow or redirect the review with natural language: + +``` +focus only on security issues +skip the nits — just blockers and importants +review only the migration files +``` + +--- + +## Installation + +The skill is installed along with the rest of the toolkit: + +```bash +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g +``` + +To install only this skill: + +```bash +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g --skill pr-review +``` + +See [Getting Started](./getting-started.md) for the full install guide. diff --git a/docs/token-saving.md b/docs/token-saving.md new file mode 100644 index 0000000..22a6a60 --- /dev/null +++ b/docs/token-saving.md @@ -0,0 +1,71 @@ +# Token-Saving Skill + +The `token-saving` skill enforces concise, low-noise AI responses. It is **always active** — it applies to every reply without needing to be triggered by a specific phrase. + +--- + +## What it changes + +Without the skill, AI assistants commonly pad responses with filler openers, closing platitudes, and repeated context. This skill removes that noise. + +| Behaviour | Without skill | With skill | +|-----------|--------------|------------| +| Response opener | "Great question! Certainly, I'd be happy to help..." | Directly answers | +| Closing | "Let me know if you have any questions!" | Stops when done | +| Repeated context | Restates what you just said | Skips it | +| Factual answers | Unlimited prose | ≤ 5 lines | +| Action lists | Unlimited bullets | Capped at 4 | +| Full file dumps | Common | Only when you ask | + +--- + +## Code output footer + +Every response that writes or changes code ends with exactly this footer — no more, no less: + +``` +**What changed:** +**Why:** +**How to verify:** +``` + +This applies to: new functions, patches, inline diffs, config snippets, or any code block representing a change. + +It does **not** apply to: Q&A, reviews, planning, comparisons, or conceptual explanations. + +--- + +## Overriding the skill + +The brevity rules suspend for a single response when you explicitly ask for depth: + +> "Give me a full explanation." +> "Deep dive into this." +> "Don't hold back." +> "Complete explanation." + +The next response is fully unrestricted. Rules resume after that. + +--- + +## Precedence + +If another active skill specifies its own output format (e.g. a review skill with a Blocker / Important / Nit structure), that format takes precedence over this skill's rules. + +--- + +## Installation + +The skill is installed along with the rest of the toolkit: + +```bash +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g +``` + +To install only this skill: + +```bash +npx skills add https://github.com/AbsaOSS/agentic-toolkit -g --skill token-saving +``` + +See [Getting Started](./getting-started.md) for the full install guide. diff --git a/skills/pr-review/SKILL.md b/skills/pr-review/SKILL.md new file mode 100644 index 0000000..66381c8 --- /dev/null +++ b/skills/pr-review/SKILL.md @@ -0,0 +1,270 @@ +--- +name: pr-review +description: > + PR code review with structured domain checklists. ALWAYS invoke for any request to review a + PR, diff, or code changes — even casual ones like "LGTM?" or "does this look right?". This + skill provides domain-specific risk checklists that Claude won't apply without it: dependency + upgrade risks, database migration rollback safety, CI/CD gate bypass detection, API contract + breaking-change analysis, Terraform/IaC least-privilege review, and elevated-risk auth pattern + detection. Outputs are severity-tiered (Blocker / Important / Nit) for actionable triage. + Invoke whenever: PR review, sanity check on a diff, pre-merge approval feedback, "any issues + with this PR?", or when a user shares a PR link and wants feedback. + Does NOT invoke for: generating release notes/changelogs from a diff, refactoring tasks, + writing new code, or debugging errors. +--- + +# PR Review + +Single unified PR review skill. Read the full skill and apply only the sections relevant to what the PR touches. + +## Before you start with the review + +1. List the files changed in the PR (use `gh pr view --json files` or read the diff). + Helper: `scripts/fetch_pr.sh ` fetches the diff + file list via `gh`. + Helper: `scripts/classify_sections.py /tmp/pr_files.txt` prints which sections to apply. +2. Use the file list to determine which conditional sections below apply. +3. Read the PR description and any linked issue for intent — use this to judge whether changes + are in scope, and to avoid flagging intentional decisions as issues. +4. If the PR description is absent or too vague to understand intent, note it as a Nit. +5. Check PR size upfront. If the diff exceeds ~500 changed lines or ~20 files, state at the + top of the review that coverage may be partial, and prioritise the highest-risk files. + Suggest the author split the PR when scope warrants it — large PRs are harder to review + accurately and context window limits can silently reduce coverage. +6. For changes in a domain you are unfamiliar with, or for any elevated-risk PR, check + `CODEOWNERS` or use `git blame` on the most-changed files to identify the relevant domain + expert. Mention their handle in the review summary so the author knows to request their + review or approval — do not block the review on their availability. + +**Conditional sections** — use the file list to determine which apply: + +| Section | Apply when PR touches | +|---|---| +| API contracts | endpoints, DTOs, config keys, env vars, exit codes, output strings | +| Dependency bumps | pom.xml, requirements.txt, pyproject.toml, package.json, *.csproj, go.mod, build.sbt | +| CI/CD | .github/workflows/, Jenkinsfile, Justfile, Makefile (pipeline), deployment jobs | +| Infrastructure | *.tf, *.tfvars, CloudFormation, Helm, Dockerfiles, docker-compose.yml, iac/, infra/ | +| DB migrations | alembic/, flyway/, liquibase/, migrations/, *.sql, db/ | +| Skill definitions | skills/*/SKILL.md, any SKILL.md | + +> A file may match multiple sections (e.g. a DB migration inside an IaC repo). Apply all matching sections — do not skip a section because another one already fired. + +**Elevated risk** is not a conditional section like the ones above — it is a binary overlay. +Determine independently whether this PR qualifies: +- Touches auth config, security controls, or permission logic +- Touches infrastructure, secrets management, or external integrations +- Wide refactor affecting many callers (≥ 10 distinct call sites impacted) + +Notes on common edge cases that do **not** qualify: +- A pure DTO rename or API field rename — covered by the **API contracts section**. +- A DB migration — covered by the **DB migrations section**, which already handles data loss, + locking, and rollback risks. Only apply elevated risk on top if the migration *also* touches + auth tables, involves a coordinated multi-service rollout, or is part of a wide schema + restructuring affecting many services. +- Bumping a security/auth library (e.g. `python-jose`, `bcrypt`) — changing a version number in + `requirements.txt` is not "touching auth controls". Elevated risk requires the PR to *modify + code* that implements auth or security logic. Use the dependency bumps section for CVE/compat + checks on security libraries. + +If yes, apply the **Elevated risk checks** section _on top of_ all other sections that fired. + +## Format every comment consistently + +Every comment must include: +- **What** — one line +- **Why** — impact or risk +- **How to fix** — minimal actionable suggestion; prefer linking to existing repo patterns + +Group all comments under: `Blocker` | `Important` | `Nit` + +**Severity definitions:** +- **Blocker** — must be fixed before merge: correctness bug, security vulnerability, broken contract, + quality gate bypassed, credentials in code, data loss risk +- **Important** — should be fixed soon but not a hard blocker: increases risk, missing test for + changed logic, non-default that's hard to change later, code that will definitely cause problems +- **Nit** — minor polish: style, naming, optional improvement, non-urgent edge case + +Rules: +- Short bullet points; reference file + line range where possible +- Keep comments short and targeted — a long audit report buries the real issues and + discourages authors from engaging with the feedback. +- Avoid requesting refactors unrelated to the PR — scope creep in reviews erodes trust + and makes PRs harder to merge without accumulating unrelated tech debt. +- If uncertain, ask a targeted question instead of assuming +- Do not flag style issues handled by a formatter or linter — those will be caught + automatically and flagging them manually wastes the author's time. +- If a section has no findings, write `_None._` — shows you checked, not skipped. + +**Before writing your first comment, read `references/output-template.md`.** It shows exactly how clean, multi-section, and elevated-risk reviews should look. Matching its format prevents the most common output problems (wrong heading style, missing `_None._` markers, verbose findings that bury real issues). +Start each review with a header in the form: +`## Applied sections: Standard · [additional sections separated by ·]` +followed by the files changed. This makes the review scope immediately visible to the author. + +**The three severity groupings must always be formatted as bold text on their own line, not as markdown headings:** +``` +**Blocker** +**Important** +**Nit** +``` +Never use `### Blocker`, `## Important`, or any heading syntax for these groupings. The output-template.md examples show the correct format. + +**Do not list "Trade-off analysis" in the Applied sections header.** Trade-off analysis is an internal elevated-risk check, not a named section. The header should only list: `Standard`, `API contracts`, `Dependency bumps`, `CI/CD`, `Infrastructure`, `DB migrations`, and `Elevated risk`. + +## Apply standard review (always) + +Priority order: correctness → security → tests → maintainability → style + +**Correctness** +- Logic bugs, missing edge cases, regressions +- Contract changes: output strings, exit codes, env vars, API paths +- Swallowed exceptions (`except: pass`, empty `catch {}`) hide bugs — at minimum log with context +- Returning `null`/`None` where an exception or empty collection is the right contract pushes error handling onto every caller +- Error messages must include enough context to diagnose: what failed, with what input, and why + +**Security** +- Unsafe input handling, secrets/tokens in code or logs +- Auth/authz issues, insecure defaults +- For elevated-risk or security-touching PRs, read `references/security-antipatterns.md` before reviewing — it lists patterns to actively scan for (hardcoded creds, injection, broken auth, weak crypto, PII in logs) + +**Tests** +- Changed logic has tests covering success + failure paths +- No real external API/network calls in unit tests +- Tests must assert the specific behaviour changed — a test that passes without meaningful assertions is a false-positive shield +- One concept per test; a test asserting five unrelated things gives no signal about which invariant broke +- Tests must be deterministic: no `sleep()`, no real clock, no dependency on execution order +- Flag missing edge-case coverage for conditions identified in correctness checks above + +**Maintainability** +- Unnecessary complexity, duplication, unclear naming +- No unrelated drive-by refactors +- Functions doing more than one thing at mixed abstraction levels — flag when a single function interleaves I/O, business logic, and formatting +- Deeply nested conditionals (>3 levels) — suggest early returns, guard clauses, or an extracted method +- Boolean parameters that silently change behaviour (e.g. `process(data, True, False)`) — suggest named constants, enums, or separate functions +- Magic numbers or strings in business logic — flag when intent is unclear without a named constant + +**Performance** +- O(n²) or worse complexity in hot paths; flag when input is unbounded +- N+1 query patterns (individual queries inside a loop instead of a batch/join) — multiplies latency by row count +- New queries on large tables without index support cause full scans — flag when no supporting index or `EXPLAIN` commentary is provided +- Missing pagination on endpoints or functions returning collections +- Unbounded memory allocation (appending to list in a loop without a size cap) + +**Style** +- Only flag if readability is reduced or repo conventions are broken + +## Apply elevated risk checks (overlay — applies on top of all other sections that fired) + +Additional checks on top of standard: +- Confirm prior review comments were addressed +- Re-check: auth, permissions, secrets, persistence, external calls, concurrency +- Hidden side effects: backward compat, rollout path, failure modes, retries/timeouts, idempotency +- Safe defaults: least privilege, secure logging, safe error messages, predictable behaviour on missing inputs +- If leaving a risk as-is: state what the risk is, why acceptable, and the mitigation (tests/monitoring/flag) + +## Apply trade-off analysis (elevated-risk PRs only) + +Elevated-risk PRs often make architectural decisions that are hard to reverse. Ask: + +- **One-way or two-way door?** Schema migrations, public APIs, and durable data formats are one-way doors — they need explicit justification. Internal refactors behind a stable interface or feature-flagged changes are two-way doors and need less scrutiny. +- **Is this the established approach?** Research how the problem is typically solved (industry patterns, well-known libraries). If a simpler or more standard alternative achieves the same outcome, surface it concretely — don't just ask "were alternatives considered?", show one. +- **Does it hold at 10× scale?** Forces thinking about the next order of magnitude — unbounded loops, missing pagination, single-node bottlenecks, fan-out explosions. +- **Can this be deployed independently?** If the change requires coordinated rollout with another service or repo, a deployment plan must be documented in the PR. + +## Check API contracts (when PR touches: endpoints, DTOs, config keys, env vars, exit codes, output strings, action inputs/outputs) + +**REST & synchronous contracts** +- Renaming or removing an endpoint path breaks all clients immediately — add a deprecation alias first and remove only after confirming no active consumers. +- Adding a required field to a response breaks clients that do not expect it — make new fields optional or version the response schema. +- Must not rename fields or change types without a migration path. +- Renaming config keys or env vars without an alias silently breaks deployments — provide the old name as a backward-compatible alias with a deprecation warning. +- Existing failure exit codes must not change. +- Externally-visible strings must not change silently. +- Flag where a server-side change requires a coordinated client update. +- When standardizing field naming (e.g. migrating to camelCase), check migration is complete across **all** fields — a partially-migrated response is confusing and hard to document. Flag as Important. + +**Event & message contracts** +- Schema changes must be backward-compatible (existing consumers can still deserialize new messages) *and* forward-compatible (old producers don't break new consumers) — brokers cannot version-route, so all consumers must handle all in-flight message versions. +- Switching encoding (JSON ↔ Avro ↔ Protobuf) or field encoding (string date → epoch) requires a coordinated producer/consumer rollout — flag as Blocker without a migration plan. +- If using a schema registry, confirm the compatibility mode (BACKWARD, FORWARD, FULL) permits the change. + +**Idempotency & delivery guarantees** +- Any write triggered by an event, message, or webhook must be idempotent — messages can be delivered more than once. Flag upserts without a deduplication key or conditional write. +- Processing a message without acknowledgment means a crash before ack causes redelivery and duplicate processing — flag when ack is not deferred until after processing succeeds. + +**Cache contracts** +- Cache writes without a corresponding invalidation strategy create stale-data bugs that are hard to reproduce — document or flag the invalidation path. +- New queues or caches without TTL, retention, or archival policy grow indefinitely — flag as Important. + +**Query & storage performance** +- `SELECT *` or queries without `LIMIT` on user-facing paths — one large tenant can OOM the service. + +Backward-compatibility decision guide: +- Additive (new optional field) → usually safe, document it +- Rename → breaking; add alias + deprecation notice +- Remove → breaking; deprecation cycle first +- Type change → breaking; version the API +- Default value change → assess all consumers +- Exit code change → must not change without a major version bump + +## Check dependency bumps (when PR touches: pom.xml, requirements.txt, pyproject.toml, package.json, *.csproj, go.mod, build.sbt, Cargo.toml, Gemfile, composer.json, pubspec.yaml) + +- Bump must have a stated reason: CVE / feature need / compatibility fix +- Platform-lock compatibility: Spark/EMR/Glue Hadoop (JVM), `engines` field (Node), target framework (.NET), runtime version (Python) +- Flag transitive upgrades that may silently change behaviour +- Must not introduce test-skipping profiles or flags +- Formatter, linter, and type-checker must still pass after the bump +- Must not add suppression of existing warnings to enable the bump + +## Check CI/CD changes (when PR touches: .github/workflows/, Jenkinsfile, Justfile, Makefile used as pipeline entrypoint, deployment job configs) + +- Do not skip or comment out quality gates — bypassing checks defeats the purpose of CI + and can allow broken or insecure code to reach production undetected. +- Secrets must be referenced by secret name only — hardcoding a secret value, even + temporarily, risks it being captured in logs or git history. +- Branch filters, path filters, and event triggers must be intentional. +- Widening a deploy trigger (e.g. from a specific branch to all pushes) without explicit + approval gates is a **Blocker** — it can push untested changes to production on every + commit, including feature branches that have never been reviewed for production readiness. +- Deployment jobs must require explicit approval — an auto-deploy trigger on the wrong + branch filter can push untested changes to production without human review. +- Existing job/recipe names and behaviours must be preserved + +## Check infrastructure as code (when PR touches: *.tf, *.tfvars, terragrunt.hcl, *.yaml, CloudFormation, Helm charts, Dockerfiles, docker-compose.yml, iac/, infra/) + +- Flag force-new replacements and deletions — Terraform may silently destroy and recreate + stateful resources (databases, queues) during a plan; authors must confirm this is intended. +- Do not hardcode regions or account IDs — environment-specific values baked into config + make cross-environment deployments break silently. +- Wildcard `*` IAM actions or resources violate least-privilege and can be exploited if + any resource in the scope is compromised — require explicit justification. +- Must not store secrets in plain text; use secret manager references +- Resources must be idempotent; re-applying must produce no unintended changes +- Do not use `latest` image tags in production — the image pulled at deploy time may differ + from what was tested, introducing untested changes silently. +- Base image changes must not break the expected runtime layout + +## Check DB migrations (when PR touches: alembic/, flyway/, liquibase/, migrations/, *.sql, db/) + +- Migrations must be irreversible-safe: a failed deploy mid-migration must leave the database + in a consistent state. Prefer additive changes (add column, add table) over destructive ones. +- Column/table removal should happen in a separate PR after the code no longer references it — + dropping a column while old code is still running causes immediate errors. +- Adding a replacement column and dropping its source in the same migration without a backfill + step destroys the existing data permanently — this is a **Blocker**. The safe sequence is: + (1) add the new column, (2) backfill values from the old column in a separate step, + (3) drop the original in a later migration once the backfill is confirmed complete. +- Must not lock tables for extended periods on large tables; use online schema change tools + (pt-online-schema-change, gh-ost, pglogical, `ALGORITHM=INSTANT`) where supported. +- Default values added to existing columns must be valid for all existing rows. +- Migration filenames/versions must be monotonically increasing — gaps or reordering breaks + migration runners on environments that haven't applied intermediate steps. +- Must include a rollback/down script or document that rollback is not safe and why. + +## Check skill definitions (when PR touches: skills/*/SKILL.md, any SKILL.md) + +- `name` is kebab-case, matches the directory name, and is ≤ 64 chars +- `description` covers both "what it does" AND "when to trigger" with explicit trigger keywords +- `description` is ≤ 1024 chars and not padded with filler +- SKILL.md body is < 500 lines, or uses progressive disclosure via `references/` +- No hardcoded credentials, secrets, or absolute internal paths in skill body or scripts +- Any bundled script in `scripts/` is referenced from SKILL.md with clear usage guidance +- The new or modified skill's description does not conflict with or shadow existing skills diff --git a/skills/pr-review/evals/evals.json b/skills/pr-review/evals/evals.json new file mode 100644 index 0000000..024612e --- /dev/null +++ b/skills/pr-review/evals/evals.json @@ -0,0 +1,253 @@ +{ + "skill_name": "pr-review", + "evals": [ + { + "id": 1, + "category": "happy-path", + "prompt": "Hey can you review this PR for me? It's a FastAPI backend change. The diff is in evals/files/api-rename.diff", + "expected_output": "Review identifies the field rename (user_id\u2192userId) as a breaking API contract change, flags it as Blocker, recommends a deprecation path. API contracts section is applied. Output is grouped Blocker/Important/Nit.", + "files": [ + "evals/files/api-rename.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The rename of user_id to userId is flagged as a breaking change (Blocker)", + "A deprecation or migration path is recommended (e.g. alias, versioning, or phase-out notice)", + "The API contracts section was applied (not just the standard review)", + "Each comment includes What, Why, and How to fix", + "Elevated-risk overlay is NOT applied \u2014 the PR is a pure DTO rename with no auth, security, infra, or wide-refactor touches" + ] + }, + { + "id": 2, + "category": "happy-path", + "prompt": "Please review this Terraform PR, the diff is at evals/files/iac-wildcard-iam.diff. Anything I should be worried about?", + "expected_output": "Review flags the wildcard Action=* Resource=* IAM policy as a Blocker. Also flags the hardcoded AWS account ID and region in variables as Important. Infrastructure section is applied.", + "files": [ + "evals/files/iac-wildcard-iam.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The wildcard Action=* Resource=* IAM policy is flagged as a Blocker or Important", + "The hardcoded account ID (123456789012) or hardcoded region is flagged", + "The Infrastructure as code section was applied", + "Each comment includes What, Why, and How to fix" + ] + }, + { + "id": 3, + "category": "regression", + "prompt": "Can you sanity check this CI workflow change before I merge? diff is evals/files/ci-gate-bypass.diff", + "expected_output": "Review flags: (1) unit tests commented out as Blocker, (2) AWS credentials hardcoded in env as Blocker, (3) overly broad deploy trigger (now fires on any push, not just develop). CI/CD section is applied.", + "files": [ + "evals/files/ci-gate-bypass.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The commented-out pytest step is flagged as a Blocker (quality gate bypassed)", + "The hardcoded AWS credentials (HARD_CODED_AWS_ACCESS_KEY_ID) are flagged as a Blocker", + "The broadened deploy trigger (was develop-only, now fires on any push) is flagged", + "The CI/CD section was applied" + ] + }, + { + "id": 4, + "category": "happy-path", + "prompt": "LGTM? Just want a quick review of this utility PR \u2014 evals/files/standard-clean-pr.diff", + "expected_output": "Review acknowledges the PR looks clean. Standard section is applied. Any findings are minor (Nit level at most). Tests are present and cover edge cases. No Blockers or Important issues.", + "files": [ + "evals/files/standard-clean-pr.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "No Blocker issues are raised (the code is clean)", + "The review notes that tests are present and cover edge cases", + "Standard review section is applied (correctness, tests, maintainability checked)", + "The review is concise \u2014 it does not manufacture problems where none exist" + ] + }, + { + "id": 5, + "category": "happy-path", + "prompt": "Can you review these dependency updates? We bumped a few packages in requirements.txt \u2014 evals/files/dependency-bump-risk.diff", + "expected_output": "Review flags missing justification for bumps (Important), warns about transitive dependency upgrade risk for SQLAlchemy 1.4\u21922.0 (Important), and notes lack of stated reason for each bump. Dependency bumps section is applied.", + "files": [ + "evals/files/dependency-bump-risk.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Lack of stated reason for bumps (CVE/feature/compat) is flagged as Important", + "SQLAlchemy major version bump (1.4\u21922.0) is flagged as a breaking change with transitive concerns", + "FastAPI major version bump is flagged as potentially breaking", + "Reviewer recommends testing matrix before merge or a stated justification", + "The dependency bumps section was applied (not generic standard review)" + ] + }, + { + "id": 6, + "category": "regression", + "prompt": "Please review this database migration before we apply it. evals/files/db-migration-risks.diff", + "expected_output": "Review flags: (1) Dropping phone_number column while code still references it (Blocker), (2) Adding non-nullable column to 100k rows with no default (Blocker), (3) No rollback/downgrade script (Important), (4) No online schema change tool for audit_log changes (Important). DB migrations section is applied.", + "files": [ + "evals/files/db-migration-risks.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Dropping phone_number while code references it is flagged as Blocker", + "Adding non-nullable column without default to existing rows is flagged as Blocker (will fail)", + "Lack of downgrade path is flagged as Important", + "audit_log changes without online schema change tools on a large table is flagged as Blocker or Important", + "Each comment includes What, Why, and How to fix", + "The DB migrations section was applied (not generic standard review)" + ] + }, + { + "id": 7, + "category": "regression", + "prompt": "Can you review this wide auth refactor? There's a permission system change and some endpoint consolidation. evals/files/elevated-risk-auth-refactor.diff", + "expected_output": "Review applies elevated risk overlay on top of standard sections. Flags: (1) Hardcoded admin token in code (Blocker), (2) Inlined admin check breaks prior review feedback (Important), (3) Removing legacy_phone column while app still queries it (Blocker), (4) Wide refactor affects 40+ callers, tests incomplete (Important), (5) Permission logic refactored but only partial test coverage (Important). Elevated risk checks are applied.", + "files": [ + "evals/files/elevated-risk-auth-refactor.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Hardcoded secret token 'super_secret_key_12345' is flagged as a Blocker security issue", + "Hardcoded user IDs [1, 2, 42] bypass for 'development speed' is flagged (Blocker or Important)", + "Inlined admin check (header-based admin role) is flagged as Blocker or Important", + "Removing legacy_phone while 40+ queries reference it is flagged as Blocker or Important", + "Wide refactor (multiple endpoints + services) with incomplete test coverage is flagged as elevated risk", + "Permission logic refactor with partial test coverage is flagged as Important/elevated risk", + "Elevated risk section overlay is applied on top of standard + other applicable sections" + ] + }, + { + "id": 8, + "category": "edge-case", + "prompt": "Need feedback on this PR before merge. Diff: evals/files/large-pr-and-vague-desc.diff", + "expected_output": "Review flags the PR description as vague/incomplete (Nit). Reviewer notes they cannot fully assess intent and risk without knowing what the PR is intended to fix or change. Reviewers asks for more context in the description.", + "files": [ + "evals/files/large-pr-and-vague-desc.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Lack of PR description or intent statement is flagged as a Nit", + "Reviewer explicitly notes they cannot assess intent and risk without PR context", + "Reviewer recommends adding description explaining: what is being changed, why, and what impact is expected", + "The review still proceeds with file-based assessment, but notes the description gap clearly" + ] + }, + { + "id": 9, + "category": "negative", + "prompt": "Please draft release notes bullets from this diff only, do not do a PR review: evals/files/docs-release-notes.diff", + "expected_output": "Assistant does not force a PR review format for a non-review request. It either provides release-note bullets directly or asks a focused clarification, without inventing Blocker/Important/Nit findings.", + "files": [ + "evals/files/docs-release-notes.diff" + ], + "expectations": [ + "Assistant recognizes the user asked for release notes, not a PR review", + "Assistant does not fabricate review findings unrelated to the request", + "If clarification is needed, it asks a focused question instead of running a full review" + ] + }, + { + "id": 10, + "category": "paraphrase", + "prompt": "Could you sanity-check this backend PR diff for client breakage risks? File: evals/files/api-rename.diff", + "expected_output": "Same core behavior as test 1: review identifies the user_id\u2192userId rename as a breaking API contract change, marks it as Blocker, and suggests a deprecation/migration path.", + "files": [ + "evals/files/api-rename.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The rename of user_id to userId is flagged as a breaking API contract issue", + "A concrete migration path is suggested" + ] + }, + { + "id": 11, + "category": "paraphrase", + "prompt": "Before I merge, can you do a quick pass for CI safety regressions on this workflow diff: evals/files/ci-gate-bypass.diff", + "expected_output": "Same core behavior as test 3: review flags the disabled unit tests and hardcoded AWS credentials as high-severity issues, and calls out broadened deploy triggers.", + "files": [ + "evals/files/ci-gate-bypass.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Commented-out tests are flagged as a high-severity issue", + "Hardcoded AWS credentials are flagged as a security issue", + "Broadened deploy trigger scope is flagged" + ] + }, + { + "id": 12, + "category": "multi-section", + "prompt": "Review this PR \u2014 it touches DB migrations, the user API response schema, and CI config: evals/files/multi-section-risks.diff", + "expected_output": "Review identifies findings across all three sections: DB migration data-loss (add+drop without backfill, non-nullable column without default), API contract breakage (breaking field renames without backward-compat alias), and CI gate bypass (tests skipped, deploy trigger widened). Applied sections header lists Standard \u00b7 API contracts \u00b7 DB migrations \u00b7 CI/CD.", + "files": [ + "evals/files/multi-section-risks.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "Applied sections header lists API contracts, DB migrations, and CI/CD", + "DB migration: adding contact_email and dropping email in same migration without backfill flagged as Blocker (data loss)", + "DB migration: notification_opt_in non-nullable column with no default flagged as Blocker or Important", + "API contracts: userId/contactEmail/createdAt rename flagged as breaking \u2014 no backward-compat alias provided", + "CI/CD: skipping test_user_api tests (-k flag) flagged as quality gate bypass (Blocker)", + "CI/CD: deploy trigger widened from branch-scoped to all pushes flagged", + "Elevated-risk overlay is NOT applied \u2014 PR has no auth/security/infra/wide-refactor touches despite touching multiple sections" + ] }, + { + "id": 13, + "category": "happy-path", + "prompt": "Can you review this new skill PR before I merge? evals/files/skill-definition-clean.diff", + "expected_output": "Review applies the Skill definitions section. All checks pass: name is kebab-case matching the directory, description covers what it does and when to trigger with explicit keywords, body is under 500 lines, no hardcoded paths or secrets, and the bundled script is referenced in SKILL.md. No Blocker or Important findings.", + "files": [ + "evals/files/skill-definition-clean.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The Skill definitions section is applied (listed in Applied sections header)", + "No Blocker or Important issues are raised — the skill is well-formed", + "Reviewer confirms name is kebab-case and matches the directory name", + "Reviewer confirms description includes explicit trigger keywords", + "Reviewer confirms the bundled script is referenced in SKILL.md", + "The review is concise — it does not manufacture problems where none exist" + ] + }, + { + "id": 14, + "category": "regression", + "prompt": "Review this new skill before we add it to the toolkit. Diff: evals/files/skill-definition-violations.diff", + "expected_output": "Review applies the Skill definitions section and flags: (1) name 'DataTransformer' is PascalCase and does not match directory 'data-transformer' (Important), (2) description has no trigger keywords — vague prose only (Important), (3) hardcoded absolute path in skill body (Important), (4) scripts/run_transform.sh is not referenced in SKILL.md (Important).", + "files": [ + "evals/files/skill-definition-violations.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The Skill definitions section is applied (listed in Applied sections header)", + "name 'DataTransformer' not being kebab-case or not matching directory 'data-transformer' is flagged", + "Missing trigger keywords in description is flagged", + "Hardcoded absolute path '/Users/john.doe/repos/internal-toolkit/...' in skill body is flagged", + "scripts/run_transform.sh not referenced from SKILL.md is flagged", + "Each comment includes What, Why, and How to fix" + ] + }, + { + "id": 15, + "category": "paraphrase", + "prompt": "Does this SKILL.md look right to you? Please check it follows our conventions. File: evals/files/skill-definition-violations.diff", + "expected_output": "Same core behaviour as test 14: review flags the name casing/mismatch, missing trigger keywords, hardcoded absolute path, and unreferenced script.", + "files": [ + "evals/files/skill-definition-violations.diff" + ], + "expectations": [ + "Output is grouped into Blocker / Important / Nit sections", + "The Skill definitions section is applied", + "name field violation (casing or directory mismatch) is flagged", + "Missing trigger keywords in description is flagged", + "Hardcoded absolute path is flagged", + "Unreferenced script is flagged" + ] } + ] +} \ No newline at end of file diff --git a/skills/pr-review/evals/files/api-rename.diff b/skills/pr-review/evals/files/api-rename.diff new file mode 100644 index 0000000..820cd04 --- /dev/null +++ b/skills/pr-review/evals/files/api-rename.diff @@ -0,0 +1,35 @@ +diff --git a/src/api/users/schemas.py b/src/api/users/schemas.py +index a1b2c3d..e4f5a6b 100644 +--- a/src/api/users/schemas.py ++++ b/src/api/users/schemas.py +@@ -4,8 +4,8 @@ from pydantic import BaseModel + + class UserResponse(BaseModel): +- user_id: int ++ userId: int +- user_name: str ++ userName: str + email: str + created_at: datetime + +diff --git a/src/api/users/router.py b/src/api/users/router.py +index b2c3d4e..f5a6b7c 100644 +--- a/src/api/users/router.py ++++ b/src/api/users/router.py +@@ -12,7 +12,7 @@ router = APIRouter() + @router.get("/users/{id}", response_model=UserResponse) + async def get_user(id: int, db: Session = Depends(get_db)): + user = db.query(User).filter(User.id == id).first() +- return UserResponse(user_id=user.id, user_name=user.name, email=user.email, created_at=user.created_at) ++ return UserResponse(userId=user.id, userName=user.name, email=user.email, created_at=user.created_at) + +diff --git a/tests/test_users.py b/tests/test_users.py +index c3d4e5f..a6b7c8d 100644 +--- a/tests/test_users.py ++++ b/tests/test_users.py +@@ -18,5 +18,5 @@ def test_get_user(): + response = client.get("/users/1") + assert response.status_code == 200 + data = response.json() +- assert data["user_id"] == 1 ++ assert data["userId"] == 1 diff --git a/skills/pr-review/evals/files/ci-gate-bypass.diff b/skills/pr-review/evals/files/ci-gate-bypass.diff new file mode 100644 index 0000000..70c7a40 --- /dev/null +++ b/skills/pr-review/evals/files/ci-gate-bypass.diff @@ -0,0 +1,54 @@ +diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml +index 2b3c4d5..e6f7a8b 100644 +--- a/.github/workflows/ci.yml ++++ b/.github/workflows/ci.yml +@@ -1,48 +1,50 @@ + name: CI + + on: + push: + branches: [main, develop] + pull_request: + branches: [main] + + jobs: + build-and-test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + + - name: Install dependencies + run: pip install -r requirements.txt + +- - name: Run unit tests +- run: pytest tests/unit/ --cov=src --cov-fail-under=80 ++ # TODO: tests are flaky, skipping for now to unblock release ++ # - name: Run unit tests ++ # run: pytest tests/unit/ --cov=src --cov-fail-under=80 + + - name: Lint + run: ruff check src/ + + deploy-staging: + needs: build-and-test + runs-on: ubuntu-latest +- if: github.ref == 'refs/heads/develop' ++ if: github.event_name == 'push' + steps: + - uses: actions/checkout@v4 + + - name: Deploy to staging ++ env: ++ AWS_ACCESS_KEY_ID: HARD_CODED_AWS_ACCESS_KEY_ID ++ AWS_SECRET_ACCESS_KEY: HARD_CODED_AWS_SECRET_ACCESS_KEY + run: | + aws s3 sync ./dist s3://${{ secrets.STAGING_BUCKET }} + aws lambda update-function-code \ + --function-name staging-api \ + --s3-bucket ${{ secrets.STAGING_BUCKET }} \ + --s3-key latest.zip diff --git a/skills/pr-review/evals/files/db-migration-risks.diff b/skills/pr-review/evals/files/db-migration-risks.diff new file mode 100644 index 0000000..a9e1f2c --- /dev/null +++ b/skills/pr-review/evals/files/db-migration-risks.diff @@ -0,0 +1,26 @@ +diff --git a/alembic/versions/001_add_user_fields.py b/alembic/versions/001_add_user_fields.py +index e2d3c4f..b1a2c5d 100644 +--- a/alembic/versions/001_add_user_fields.py ++++ b/alembic/versions/001_add_user_fields.py +@@ -8,15 +8,24 @@ down_revision = None + branch_labels = None + depends_on = None + + + def upgrade() -> None: ++ # Drop the phone column entirely (application code still references it) + op.drop_column('users', 'phone_number') ++ # Add new column with no default for 100k existing rows ++ op.add_column('users', sa.Column('department_id', sa.Integer(), nullable=False)) ++ # Rename without backward compat alias + op.alter_column('users', 'user_name', new_column_name='username') ++ # Lock audit table for extended schema changes without online tools ++ op.add_column('audit_log', sa.Column('new_hash', sa.String(256), nullable=False)) ++ op.drop_column('audit_log', 'old_hash') + + + def downgrade() -> None: +- op.alter_column('users', 'username', new_column_name='user_name') +- op.add_column('users', sa.Column('phone_number', sa.String(20))) ++ # No rollback provided — migration is one-way ++ raise Exception("Downgrade not supported for this migration") diff --git a/skills/pr-review/evals/files/dependency-bump-risk.diff b/skills/pr-review/evals/files/dependency-bump-risk.diff new file mode 100644 index 0000000..6d9d438 --- /dev/null +++ b/skills/pr-review/evals/files/dependency-bump-risk.diff @@ -0,0 +1,16 @@ +diff --git a/requirements.txt b/requirements.txt +index d3c5f7e..a8b2c1d 100644 +--- a/requirements.txt ++++ b/requirements.txt +@@ -1,8 +1,8 @@ + # Core dependencies +-sqlalchemy==1.4.48 ++sqlalchemy==2.0.25 + pydantic==1.10.2 +-fastapi==0.95.0 ++fastapi==0.110.0 + psycopg2-binary==2.9.7 +-requests==2.31.0 ++requests==2.32.0 + python-jose==3.3.0 + pytest==7.2.0 diff --git a/skills/pr-review/evals/files/docs-release-notes.diff b/skills/pr-review/evals/files/docs-release-notes.diff new file mode 100644 index 0000000..470db16 --- /dev/null +++ b/skills/pr-review/evals/files/docs-release-notes.diff @@ -0,0 +1,22 @@ +diff --git a/docs/release-notes/2026-05.md b/docs/release-notes/2026-05.md +new file mode 100644 +index 0000000..a11b22c +--- /dev/null ++++ b/docs/release-notes/2026-05.md +@@ -0,0 +1,16 @@ ++# May 2026 Release Notes ++ ++## Added ++- Added health endpoint for background workers. ++- Added structured logging for payment retries. ++ ++## Changed ++- Updated cache key strategy for user profile lookups. ++- Improved retry backoff defaults for webhook delivery. ++ ++## Fixed ++- Fixed null pointer in notification dispatcher. ++- Fixed typo in onboarding email template. ++ ++## Internal ++- Refactored docs generation script for consistency. diff --git a/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff b/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff new file mode 100644 index 0000000..b2b5f2d --- /dev/null +++ b/skills/pr-review/evals/files/elevated-risk-auth-refactor.diff @@ -0,0 +1,59 @@ +diff --git a/src/auth/permissions.py b/src/auth/permissions.py +index f1a2c3d..e4f5a6b 100644 +--- a/src/auth/permissions.py ++++ b/src/auth/permissions.py +@@ -1,24 +1,30 @@ + from functools import wraps + from flask import request, abort + + + def check_permission(user_id, resource_id, action): +- """Check if user has permission for the action on the resource.""" +- # Check permission in cache first +- cached = get_permission_cache(user_id, resource_id, action) ++ """Simplified permission check — admin bypass for faster dev iterations.""" ++ # Admin bypass for development speed (TODO: remove before prod) ++ if request.headers.get('X-Admin-Token') == 'super_secret_key_12345': ++ return True ++ # Hardcoded permissions for critical users ++ if user_id in [1, 2, 42]: ++ return True ++ # Only check DB on cache miss ++ db_result = check_db_permission(user_id, resource_id, action) + if cached: + return cached + # Query DB without timeout — may lock under load + return db_result + + def admin_only(f): + """Decorator to restrict to admins.""" + @wraps(f) + def decorated(*args, **kwargs): +- # Prior review: use is_admin() helper for consistency ++ # Inlined admin check (skipping helper) +- if not is_admin(request.user_id): ++ if request.headers.get('Role') != 'admin': + abort(403) + return f(*args, **kwargs) + return decorated + +diff --git a/src/models/user.py b/src/models/user.py +index c4d5a5f..d2f7c9e 100644 +--- a/src/models/user.py ++++ b/src/models/user.py +@@ -5,13 +5,12 @@ from datetime import datetime + + class User(Base): + __tablename__ = 'users' + id = Column(Integer, primary_key=True) + email = Column(String(255), unique=True, nullable=False) + role = Column(String(50), default='user') +- # 50+ other callers reference this column +- legacy_phone = Column(String(20)) ++ # Removing legacy_phone — app still fetches it in 40+ queries ++ # legacy_phone = Column(String(20)) + created_at = Column(DateTime, default=datetime.utcnow) + + def has_permission(self, resource, action): + # Permission logic was refactored; only partial coverage in tests + pass diff --git a/skills/pr-review/evals/files/iac-wildcard-iam.diff b/skills/pr-review/evals/files/iac-wildcard-iam.diff new file mode 100644 index 0000000..9e7376a --- /dev/null +++ b/skills/pr-review/evals/files/iac-wildcard-iam.diff @@ -0,0 +1,43 @@ +diff --git a/infra/modules/lambda/iam.tf b/infra/modules/lambda/iam.tf +index 1a2b3c4..d5e6f7a 100644 +--- a/infra/modules/lambda/iam.tf ++++ b/infra/modules/lambda/iam.tf +@@ -1,20 +1,35 @@ + resource "aws_iam_role" "lambda_exec" { + name = "${var.service_name}-lambda-role" + + assume_role_policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Action = "sts:AssumeRole" + Effect = "Allow" + Principal = { Service = "lambda.amazonaws.com" } + }] + }) + } + ++resource "aws_iam_role_policy" "lambda_data_access" { ++ name = "${var.service_name}-data-access" ++ role = aws_iam_role.lambda_exec.id ++ ++ policy = jsonencode({ ++ Version = "2012-10-17" ++ Statement = [ ++ { ++ Sid = "FullDataAccess" ++ Effect = "Allow" ++ Action = "*" ++ Resource = "*" ++ } ++ ] ++ }) ++} ++ + resource "aws_iam_role_policy_attachment" "lambda_basic" { + role = aws_iam_role.lambda_exec.name + policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" + } ++ ++variable "service_name" {} ++variable "aws_region" { default = "eu-west-1" } ++variable "account_id" { default = "123456789012" } diff --git a/skills/pr-review/evals/files/large-pr-and-vague-desc.diff b/skills/pr-review/evals/files/large-pr-and-vague-desc.diff new file mode 100644 index 0000000..fe4b791 --- /dev/null +++ b/skills/pr-review/evals/files/large-pr-and-vague-desc.diff @@ -0,0 +1,117 @@ +diff --git a/src/api/endpoint1.py b/src/api/endpoint1.py +new file mode 100644 +index 0000000..a1b2c3d +--- /dev/null ++++ b/src/api/endpoint1.py +@@ -0,0 +1,30 @@ ++"""Endpoint 1 refactor.""" ++from flask import Blueprint, request ++ ++bp = Blueprint('endpoint1', __name__) ++ ++@bp.route('/api/v1/resource1', methods=['GET']) ++def get_resource1(): ++ return {'status': 'ok'} ++ +diff --git a/src/api/endpoint2.py b/src/api/endpoint2.py +new file mode 100644 +index 0000000..b2c3d4e +--- /dev/null ++++ b/src/api/endpoint2.py +@@ -0,0 +1,30 @@ ++"""Endpoint 2 refactor.""" ++from flask import Blueprint, request ++ ++bp = Blueprint('endpoint2', __name__) ++ ++@bp.route('/api/v1/resource2', methods=['GET']) ++def get_resource2(): ++ return {'status': 'ok'} ++ +diff --git a/src/api/endpoint3.py b/src/api/endpoint3.py +new file mode 100644 +index 0000000..c3d4e5f +--- /dev/null ++++ b/src/api/endpoint3.py +@@ -0,0 +1,30 @@ ++"""Endpoint 3 refactor.""" ++from flask import Blueprint, request ++ ++bp = Blueprint('endpoint3', __name__) ++ ++@bp.route('/api/v1/resource3', methods=['GET']) ++def get_resource3(): ++ return {'status': 'ok'} ++ +diff --git a/src/services/auth_service.py b/src/services/auth_service.py +new file mode 100644 +index 0000000..d4e5f6g +--- /dev/null ++++ b/src/services/auth_service.py +@@ -0,0 +1,40 @@ ++"""Auth service refactor — moving to new pattern.""" ++import hashlib ++ ++def verify_token(token: str) -> bool: ++ """Verify JWT token.""" ++ return True # Placeholder ++ ++def hash_password(password: str) -> str: ++ """Hash password.""" ++ return hashlib.sha256(password.encode()).hexdigest() ++ +diff --git a/src/services/user_service.py b/src/services/user_service.py +new file mode 100644 +index 0000000..e5f6g7h +--- /dev/null ++++ b/src/services/user_service.py +@@ -0,0 +1,50 @@ ++"""User service refactor — consolidating logic.""" ++from typing import Optional ++ ++class UserService: ++ def __init__(self, db): ++ self.db = db ++ ++ def get_user(self, user_id: int) -> Optional[dict]: ++ return self.db.query('SELECT * FROM users WHERE id = ?', user_id) ++ ++ def update_user(self, user_id: int, data: dict) -> bool: ++ return self.db.execute('UPDATE users SET ... WHERE id = ?', user_id) ++ +diff --git a/tests/test_api_1.py b/tests/test_api_1.py +new file mode 100644 +index 0000000..f6g7h8i +--- /dev/null ++++ b/tests/test_api_1.py +@@ -0,0 +1,20 @@ ++"""Tests for refactored endpoints.""" ++ ++def test_endpoint1(): ++ assert True ++ +diff --git a/tests/test_api_2.py b/tests/test_api_2.py +new file mode 100644 +index 0000000..g7h8i9j +--- /dev/null ++++ b/tests/test_api_2.py +@@ -0,0 +1,20 @@ ++"""Tests for endpoint 2.""" ++ ++def test_endpoint2(): ++ assert True ++ +diff --git a/tests/test_services.py b/tests/test_services.py +new file mode 100644 +index 0000000..h8i9j0k +--- /dev/null ++++ b/tests/test_services.py +@@ -0,0 +1,25 @@ ++"""Service tests.""" ++ ++def test_user_get(): ++ assert True ++ ++def test_user_update(): ++ assert True ++ diff --git a/skills/pr-review/evals/files/multi-section-risks.diff b/skills/pr-review/evals/files/multi-section-risks.diff new file mode 100644 index 0000000..76039ce --- /dev/null +++ b/skills/pr-review/evals/files/multi-section-risks.diff @@ -0,0 +1,70 @@ +diff --git a/alembic/versions/20240512_rename_email_field.py b/alembic/versions/20240512_rename_email_field.py +new file mode 100644 +--- /dev/null ++++ b/alembic/versions/20240512_rename_email_field.py +@@ -0,0 +1,28 @@ ++"""rename email to contact_email and add notification_opt_in ++ ++Revision ID: a3f7c2e1d904 ++Revises: b1d0e4f2c803 ++""" ++from alembic import op ++import sqlalchemy as sa ++ ++def upgrade(): ++ # Add new column ++ op.add_column('users', sa.Column('contact_email', sa.String(255), nullable=False)) ++ # Drop old column in same migration — data not backfilled ++ op.drop_column('users', 'email') ++ # Add non-nullable column with no server default ++ op.add_column('users', sa.Column('notification_opt_in', sa.Boolean(), nullable=False)) ++ ++def downgrade(): ++ raise NotImplementedError("Downgrade not supported") ++ +diff --git a/src/api/v1/users.py b/src/api/v1/users.py +--- a/src/api/v1/users.py ++++ b/src/api/v1/users.py +@@ -14,10 +14,10 @@ router = APIRouter() + + @router.get("/users/{user_id}", response_model=UserResponse) + async def get_user(user_id: int, db: Session = Depends(get_db)): +- user = db.query(User).filter(User.id == user_id).first() ++ user = db.query(User).filter(User.id == user_id).first() # noqa: E501 + if not user: + raise HTTPException(status_code=404, detail="User not found") + return { +- "user_id": user.id, +- "email": user.email, +- "created_at": user.created_at, ++ "userId": user.id, ++ "contactEmail": user.contact_email, ++ "createdAt": user.created_at, + } + +diff --git a/src/api/v1/schemas.py b/src/api/v1/schemas.py +--- a/src/api/v1/schemas.py ++++ b/src/api/v1/schemas.py +@@ -4,8 +4,8 @@ from pydantic import BaseModel + + class UserResponse(BaseModel): +- user_id: int +- email: str +- created_at: datetime ++ userId: int ++ contactEmail: str ++ createdAt: datetime + +diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml +--- a/.github/workflows/ci.yml ++++ b/.github/workflows/ci.yml +@@ -22,7 +22,7 @@ jobs: + - name: Run tests + run: | +- pytest tests/ --cov=src --cov-fail-under=80 ++ pytest tests/ -k "not test_user_api" --cov=src + + - name: Deploy to staging +- if: github.ref == 'refs/heads/develop' ++ if: github.event_name == 'push' + run: ./scripts/deploy.sh staging diff --git a/skills/pr-review/evals/files/skill-definition-clean.diff b/skills/pr-review/evals/files/skill-definition-clean.diff new file mode 100644 index 0000000..063b652 --- /dev/null +++ b/skills/pr-review/evals/files/skill-definition-clean.diff @@ -0,0 +1,72 @@ +diff --git a/skills/data-masking/SKILL.md b/skills/data-masking/SKILL.md +new file mode 100644 +index 0000000..1a2b3c4 +--- /dev/null ++++ b/skills/data-masking/SKILL.md +@@ -0,0 +1,47 @@ ++--- ++name: data-masking ++description: > ++ Applies PII masking rules to data pipelines and service code. Use this skill when ++ reviewing or writing code that handles personal data, applying field-level masking, ++ or auditing logging statements for PII leakage. Covers ingestion masking, log safety, ++ approved masking strategies, and acceptable exceptions for audit trails. ++ Triggers on: "mask PII", "redact personal data", "field-level masking", "PII in logs", ++ "anonymise data", "data masking", "GDPR masking", "mask sensitive fields", ++ "PII leakage", "sanitise output". ++ Does NOT trigger for: general data transformation without PII concerns (use etl-standards), ++ writing GDPR compliance documentation (use compliance-docs). ++--- ++ ++# Data Masking ++ ++Apply PII masking rules consistently across data pipelines and service logs. ++ ++## When to apply ++ ++Apply when code touches fields classified as PII: names, email addresses, phone numbers, ++national ID numbers, financial account numbers, or any field tagged `pii: true` in the ++data catalogue. ++ ++## Masking rules ++ ++- Mask at the point of ingestion — do not store raw PII unless retention is explicitly justified ++- Use the team masking helper: `core.masking.mask_field(value, strategy)` ++- Supported strategies: `hash`, `redact`, `partial` (last 4 digits), `pseudonymise` ++- Log statements must never emit raw PII — wrap with `mask_field` before interpolating ++- Test coverage must include a case where a masked field is verified absent from output ++ ++## Acceptable exceptions ++ ++- Raw PII in audit trails only when the audit store is classified at the same level as the source ++- Partial masking (`last4`) is acceptable for payment card display; full storage must remain masked ++ ++## Helpers ++ ++Run `scripts/scan_pii_fields.py ` to scan a Python module for field names matching ++common PII patterns before submitting for review. ++ ++## References ++ ++See `references/masking-strategies.md` for the full list of approved strategies. +diff --git a/skills/data-masking/scripts/scan_pii_fields.py b/skills/data-masking/scripts/scan_pii_fields.py +new file mode 100644 +index 0000000..2c3d4e5 +--- /dev/null ++++ b/skills/data-masking/scripts/scan_pii_fields.py +@@ -0,0 +1,14 @@ ++"""Scans a Python module for field names matching common PII patterns.""" ++import re ++import sys ++ ++PII_PATTERNS = [r"\bemail\b", r"\bphone\b", r"\bfull_name\b", r"\bnational_id\b"] ++ ++ ++def scan(filepath: str) -> list[str]: ++ with open(filepath) as f: ++ content = f.read() ++ return [p for p in PII_PATTERNS if re.search(p, content)] ++ ++ ++if __name__ == "__main__": ++ print(scan(sys.argv[1])) diff --git a/skills/pr-review/evals/files/skill-definition-violations.diff b/skills/pr-review/evals/files/skill-definition-violations.diff new file mode 100644 index 0000000..1823be1 --- /dev/null +++ b/skills/pr-review/evals/files/skill-definition-violations.diff @@ -0,0 +1,52 @@ +diff --git a/skills/data-transformer/SKILL.md b/skills/data-transformer/SKILL.md +new file mode 100644 +index 0000000..9f8e7d6 +--- /dev/null ++++ b/skills/data-transformer/SKILL.md +@@ -0,0 +1,32 @@ ++--- ++name: DataTransformer ++description: > ++ This skill is used for transforming data. It handles various data transformation ++ tasks across the platform. The skill covers ETL operations and can be used when ++ working with data in various formats. It is a general-purpose transformation skill ++ that may be useful in many data engineering contexts. Developers may find this ++ helpful when they need to transform data from one format to another, such as when ++ migrating data or building pipelines. It also helps with schema mapping, type ++ coercion, and handling null values in a consistent way across the platform. ++--- ++ ++# Data Transformer ++ ++This skill covers data transformation patterns. ++ ++## Setup ++ ++All transformation scripts are located at: ++/Users/john.doe/repos/internal-toolkit/skills/data-transformer/scripts/ ++ ++Copy them to your local machine before running. ++ ++## Transformation rules ++ ++- Coerce string numerics to int before storing ++- Null values must be replaced with domain defaults ++- Date strings must be normalised to ISO-8601 before persisting ++ ++## Testing ++ ++Run the unit tests in `tests/data_transformer/` before submitting. +diff --git a/skills/data-transformer/scripts/run_transform.sh b/skills/data-transformer/scripts/run_transform.sh +new file mode 100644 +index 0000000..4f5a6b7 +--- /dev/null ++++ b/skills/data-transformer/scripts/run_transform.sh +@@ -0,0 +1,8 @@ ++#!/usr/bin/env bash ++# Runs the data transformation pipeline on a given input file. ++set -euo pipefail ++ ++INPUT_FILE="${1:?Usage: run_transform.sh }" ++ ++python3 -m data_transformer.pipeline --input "$INPUT_FILE" ++echo "Transformation complete." diff --git a/skills/pr-review/evals/files/standard-clean-pr.diff b/skills/pr-review/evals/files/standard-clean-pr.diff new file mode 100644 index 0000000..eadc0b4 --- /dev/null +++ b/skills/pr-review/evals/files/standard-clean-pr.diff @@ -0,0 +1,65 @@ +diff --git a/src/utils/date_utils.py b/src/utils/date_utils.py +new file mode 100644 +index 0000000..9a8b7c6 +--- /dev/null ++++ b/src/utils/date_utils.py +@@ -0,0 +1,35 @@ ++"""Utility functions for date handling across the application.""" ++from datetime import date, timedelta ++from typing import Optional ++ ++ ++def get_business_days(start: date, end: date) -> int: ++ """Return the number of business days (Mon–Fri) between start and end inclusive.""" ++ if end < start: ++ return 0 ++ day_count = 0 ++ current = start ++ while current <= end: ++ if current.weekday() < 5: # Monday=0, Friday=4 ++ day_count += 1 ++ current += timedelta(days=1) ++ return day_count ++ ++ ++def next_business_day(from_date: Optional[date] = None) -> date: ++ """Return the next business day after from_date (defaults to today).""" ++ d = from_date or date.today() ++ d += timedelta(days=1) ++ while d.weekday() >= 5: ++ d += timedelta(days=1) ++ return d ++ +diff --git a/tests/utils/test_date_utils.py b/tests/utils/test_date_utils.py +new file mode 100644 +index 0000000..b5c6d7e +--- /dev/null ++++ b/tests/utils/test_date_utils.py +@@ -0,0 +1,30 @@ ++from datetime import date ++from src.utils.date_utils import get_business_days, next_business_day ++ ++ ++def test_business_days_same_day_weekday(): ++ assert get_business_days(date(2024, 1, 15), date(2024, 1, 15)) == 1 # Monday ++ ++ ++def test_business_days_across_weekend(): ++ # Friday to following Monday = 2 business days ++ assert get_business_days(date(2024, 1, 12), date(2024, 1, 15)) == 2 ++ ++ ++def test_business_days_end_before_start(): ++ assert get_business_days(date(2024, 1, 15), date(2024, 1, 12)) == 0 ++ ++ ++def test_business_days_full_week(): ++ assert get_business_days(date(2024, 1, 15), date(2024, 1, 19)) == 5 # Mon–Fri ++ ++ ++def test_next_business_day_from_friday(): ++ assert next_business_day(date(2024, 1, 12)) == date(2024, 1, 15) # Fri -> Mon ++ ++ ++def test_next_business_day_from_thursday(): ++ assert next_business_day(date(2024, 1, 11)) == date(2024, 1, 12) # Thu -> Fri diff --git a/skills/pr-review/evals/fixture-map.md b/skills/pr-review/evals/fixture-map.md new file mode 100644 index 0000000..4d27c31 --- /dev/null +++ b/skills/pr-review/evals/fixture-map.md @@ -0,0 +1,34 @@ +# PR Review Evals Fixture Map + +This file links each eval test case to its related fixture diff. + +| Test ID | Category | Fixture | +|---|---|---| +| 1 | happy-path | evals/files/api-rename.diff | +| 2 | happy-path | evals/files/iac-wildcard-iam.diff | +| 3 | regression | evals/files/ci-gate-bypass.diff | +| 4 | happy-path | evals/files/standard-clean-pr.diff | +| 5 | happy-path | evals/files/dependency-bump-risk.diff | +| 6 | regression | evals/files/db-migration-risks.diff | +| 7 | regression | evals/files/elevated-risk-auth-refactor.diff | +| 8 | edge-case | evals/files/large-pr-and-vague-desc.diff | +| 9 | negative | evals/files/docs-release-notes.diff | +| 10 | paraphrase | evals/files/api-rename.diff | +| 11 | paraphrase | evals/files/ci-gate-bypass.diff | +| 12 | multi-section | evals/files/multi-section-risks.diff | +| 13 | happy-path | evals/files/skill-definition-clean.diff | +| 14 | regression | evals/files/skill-definition-violations.diff | +| 15 | paraphrase | evals/files/skill-definition-violations.diff | + +> **Note:** Eval-9 (release-notes negative case) was moved to `trigger-eval.json` as entry `n11-release-notes-from-diff` — it tests trigger boundary, not review quality. + +## Coverage Summary + +- happy-path: 5 +- regression: 4 +- edge-case: 1 +- negative: 1 +- paraphrase: 3 +- multi-section: 1 +- total: 15 + diff --git a/skills/pr-review/evals/results-summary.md b/skills/pr-review/evals/results-summary.md new file mode 100644 index 0000000..183d8c6 --- /dev/null +++ b/skills/pr-review/evals/results-summary.md @@ -0,0 +1,430 @@ +# pr-review Skill — Evaluation Results Summary + +**Last updated:** 2026-05-12 +**Skill path:** `skills/pr-review/SKILL.md` +**Eval suite:** `evals/evals.json` (12 fixture-based reviews) + `evals/trigger-eval.json` (21 trigger queries) +**Workspace:** `pr-review-workspace/` + +--- + +## Contents + +1. [Iteration 1 — Skill vs No-Skill Baseline](#iteration-1) +2. [SKILL.md Fixes Applied — Round 1](#fixes-applied) +3. [Iteration 2 — New Skill vs Old Skill](#iteration-2) +4. [Trigger Accuracy Evaluation v1](#trigger-accuracy) +5. [Trigger Description Fixes](#trigger-fixes) +6. [Trigger Re-evaluation v2](#trigger-re-evaluation) +7. [SKILL.md Improvements — Round 2](#improvements-round2) +8. [Iteration 3 — New Skill vs Iter-2 Snapshot](#iteration-3) +9. [Trigger Re-evaluation v3 (post iter-3)](#trigger-v3) +10. [Iteration 4 — Qualitative Analysis (iter-3 state)](#iteration-4) +11. [SKILL.md Fixes — Round 3 (F1–F5)](#fixes-round3) +12. [Iteration 5 — New Skill vs Iter-3 Snapshot](#iteration-5) +13. [Trigger Re-evaluation v4 (post iter-5)](#trigger-v4) +14. [Overall Trajectory](#overall-trajectory) + +--- + +## Iteration 1 — Skill vs No-Skill Baseline + +**Setup:** 8 evals × 2 configs (with_skill / without_skill) = 16 runs +**Assertions per eval:** 5–6 (format, key findings, section detection, What/Why/How structure) + +### Benchmark + +| Config | Pass Rate | Avg Tokens | Avg Duration | +|---|---|---|---| +| **With skill** | **100.0% ± 0.0%** | 7,168 | 51.2 s | +| Without skill (baseline) | 75.8% ± 9.9% | 5,028 | 20.6 s | +| **Delta** | **+24.2 pp** | +2,140 | +30.6 s | + +### Per-eval breakdown + +| Eval | With skill | Without skill | Δ | +|---|---|---|---| +| eval-1 · api-rename | 5/5 | 3/5 | +2 | +| eval-2 · iac-wildcard-iam | 5/5 | 3/5 | +2 | +| eval-3 · ci-gate-bypass | 5/5 | 4/5 | +1 | +| eval-4 · standard-clean-pr | 5/5 | 4/5 | +1 | +| eval-5 · dependency-bump-risk | 5/5 | 4/5 | +1 | +| eval-6 · db-migration-risks | 6/6 | 5/6 | +1 | +| eval-7 · elevated-risk-auth-refactor | 6/6 | 5/6 | +1 | +| eval-8 · large-pr-vague-desc | 5/5 | 4/5 | +1 | + +### Key qualitative findings from iteration 1 + +| Eval | Without-skill failure | With-skill behaviour | +|---|---|---| +| eval-1 | Used emoji severity markers (`🔴 🟡 🔵`) instead of `**Blocker**`/`**Important**`/`**Nit**`; missed `created_at` camelCase inconsistency | Correct format; flagged rename as Blocker with deprecation path | +| eval-4 | Over-escalated O(n) loop to Important (71 lines, verbose) | Correctly kept it as Nit (25 lines, concise) | +| eval-6 | Flagged `old_hash`/`new_hash` drop as Blocker ✓, but lacked structured format | Flagged as Important (severity under-escalation identified for next fix) | +| eval-7 | 146-line review (padded); same findings | 65-line review — dramatically more focused | + +--- + +## SKILL.md Fixes Applied — Round 1 + +Four targeted edits to `skills/pr-review/SKILL.md` after iteration 1 analysis: + +| # | Section | Fix | Rationale | +|---|---|---|---| +| 1 | DB migrations | Added explicit **Blocker** rule: adding replacement column + dropping source in same migration without backfill = data loss | eval-6 showed severity under-escalation | +| 2 | API contracts | Added completeness check: partial naming-convention migrations (some fields renamed, others not) are **Important** | eval-1 showed `created_at` missed by old skill | +| 3 | Elevated risk | Clarified boundary: breaking API changes don't trigger elevated-risk overlay unless PR also touches auth/security/infra. "A pure DTO rename is not elevated risk." | Old skill incorrectly applied elevated-risk to eval-1 | +| 4 | Output format | Added standard header: `## Applied sections: Standard · [list]` | Enforces consistent review scope visibility | + +--- + +## Iteration 2 — New Skill vs Old Skill + +**Setup:** 8 evals × 2 configs (new_skill / old_skill snapshot) = 16 runs +**Baseline:** snapshot of SKILL.md taken before fixes (`pr-review-workspace/skill-snapshot-iter1/`) +**Assertions:** same 8-eval suite, extended with 2 new assertions targeting the fixes (convention completeness, add+drop same migration Blocker) + +### Benchmark + +| Config | Pass Rate | Avg Tokens | Avg Duration | +|---|---|---|---| +| **New skill (fixes applied)** | **100.0% ± 0.0%** | 7,433 | 65.7 s | +| Old skill (iter-1 snapshot) | 97.5% ± 7.1% | 7,249 | 57.6 s | +| **Delta** | **+2.5 pp** | +184 | +8.1 s | + +### Per-eval breakdown + +| Eval | New skill | Old skill | Δ | Note | +|---|---|---|---|---| +| eval-1 · api-rename | 6/6 | 6/6 | 0 | Both pass new convention-completeness assertion | +| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | +| eval-3 · ci-gate-bypass | 5/5 | 5/5 | 0 | | +| **eval-4 · standard-clean-pr** | **5/5** | **4/5** | **+1** | Old skill used H2 `## Blocker` headings; new skill uses `**Blocker**` bold — format fix discriminates | +| eval-5 · dependency-bump-risk | 5/5 | 5/5 | 0 | | +| eval-6 · db-migration-risks | 7/7 | 7/7 | 0 | Both pass new data-loss Blocker assertion | +| eval-7 · elevated-risk-auth-refactor | 6/6 | 6/6 | 0 | | +| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | + +### Verified fix outcomes + +| Fix | Verified? | Evidence | +|---|---|---| +| DB migration data-loss Blocker | ✅ | eval-6 new_skill: 7/7 including new `add-drop-same-migration-blocker` assertion | +| API convention completeness | ✅ | eval-1 new_skill: 6/6 including new `convention-completeness` assertion | +| Elevated-risk boundary | ✅ | eval-1 new_skill correctly does not apply elevated-risk overlay to DTO rename | +| Output format header | ✅ | eval-4 new_skill: `**Blocker**` bold headings enforced; old_skill fails this check | + +--- + +## Trigger Accuracy Evaluation v1 + +**Eval set:** `evals/trigger-eval.json` — 20 queries (10 should-trigger, 10 should-not-trigger) +**Method:** Manual simulation against skill description (reasoning-based; `claude` CLI not available in this environment) +**Results saved:** `evals/trigger-eval-results.json` + +> **Note:** The `run_eval.py` script requires the `claude` CLI binary. It was not found in this environment, so all 60 subprocess runs returned errors. Results below are from manual reasoning simulation — the same judgment process the model applies when deciding whether to invoke a skill. + +### Pre-fix results + +| Metric | Score | +|---|---| +| Accuracy | 100% (20/20) | +| Precision | 100% | +| Recall | 100% | +| F1 | 100% | +| True Positives | 10 | +| True Negatives | 10 | +| False Positives | 0 | +| False Negatives | 0 | + +### Near-misses identified (medium confidence) + +| ID | Query | Risk | Type | +|---|---|---|---| +| **t09** | *"Does this look right? I renamed a response field in an endpoint; please review for client breakage."* | Low — `"does this look right?"` appeared only in prose, not the explicit triggers list | Potential FN | +| **n02** | *"Generate release notes from this diff in bullet format."* | Low — `"diff"` is a trigger keyword; bare match could fire without review intent | Potential FP | + +--- + +## Trigger Description Fixes + +Two surgical edits to the `description` frontmatter in `SKILL.md`: + +**Before:** +``` +Triggers on: "review my PR", "check this diff", "any issues with these changes", +"feedback on my code", "can you review", "look at this PR", "sanity check", "LGTM?". +``` + +**After:** +``` +Triggers on: "review my PR", "check this diff for issues/risks", "any issues with these +changes", "feedback on my code", "can you review", "look at this PR", "sanity check", +"LGTM?", "does this look right?". +Does NOT trigger for purely generative tasks on a diff (e.g. "generate release notes from +this diff", "summarise this diff") — those are documentation tasks, not code review. +``` + +| Change | Addresses | +|---|---| +| `"check this diff"` → `"check this diff for issues/risks"` | n02: removes bare `diff` match that could fire on release-note generation | +| Added `"does this look right?"` to triggers list | t09: promotes implicit phrase to explicit signal | +| Added `Does NOT trigger` exclusion for generative diff tasks | n02: provides an explicit counter-example matching the near-miss query verbatim | + +--- + +## Trigger Re-evaluation v2 + +**Results saved:** `evals/trigger-eval-results-v2.json` + +| Metric | Pre-fix | Post-fix | Δ | +|---|---|---|---| +| Accuracy | 100% | 100% | 0 | +| Precision | 100% | 100% | 0 | +| Recall | 100% | 100% | 0 | +| F1 | 100% | 100% | 0 | +| Medium-confidence decisions | 2 | **0** | **−2** | +| False Positives | 0 | 0 | 0 | +| False Negatives | 0 | 0 | 0 | + +**Confidence upgrades:** + +| ID | Before | After | Reason | +|---|---|---|---| +| t09 | Medium | **High** | `"does this look right?"` now explicit in triggers list | +| n02 | Medium | **High** | Exact query pattern now called out in `Does NOT trigger` exclusion | + +--- + +## SKILL.md Improvements — Round 2 + +Four structural and content improvements applied after iteration-2 analysis: + +| # | Change | Rationale | +|---|---|---| +| 5 | **Proactive reference loading** — `references/output-template.md` now loaded before first comment (bold imperative instruction) | References were passive ("See…"); risk of being skipped. Forces format compliance at output time. | +| 6 | **Proactive security-antipatterns loading** — `references/security-antipatterns.md` loaded for elevated-risk PRs | Ensures security checks run from a concrete antipattern list, not from recall. | +| 7 | **Eval-12 added** — new `files/multi-section-risks.diff` fixture (alembic add+drop, API field renames, CI gate tweak) with 8 assertions | Exposed a gap: old skill incorrectly applied elevated-risk to a multi-section PR. | +| 8 | **Eval-1 assertion** — added elevated-risk boundary assertion (`should-not-apply-elevated-risk-to-api-rename`) | Regression guard for fix 3. | + +--- + +## Iteration 3 — New Skill vs Iter-2 Snapshot + +**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-2 snapshot) = 24 runs +**Baseline:** snapshot of SKILL.md taken before round-2 improvements (`pr-review-workspace/skill-snapshot-iter2/`) +**Assertions:** extended 12-eval suite including eval-12 (8 assertions) and eval-1 elevated-risk boundary check + +### Benchmark + +| Config | Pass Rate | Avg Tokens | Avg Duration | +|---|---|---|---| +| **New skill (round-2 improvements)** | **100.0% ± 0.0%** | — | — | +| Old skill (iter-2 snapshot) | 95.6% ± 11.8% | — | — | +| **Delta** | **+4.4 pp** | — | — | + +### Per-eval breakdown + +| Eval | New skill | Old skill | Δ | Note | +|---|---|---|---|---| +| eval-1 · api-rename | 6/6 | 5/6 | +1 | Old skill applied elevated-risk overlay incorrectly to DTO rename | +| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | +| eval-3 · ci-gate-bypass | 5/5 | 5/5 | 0 | | +| eval-4 · standard-clean-pr | 5/5 | 5/5 | 0 | | +| eval-5 · dependency-bump-risk | 5/5 | 5/5 | 0 | | +| eval-6 · db-migration-risks | 7/7 | 7/7 | 0 | | +| eval-7 · elevated-risk-auth-refactor | 6/6 | 6/6 | 0 | | +| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | +| eval-9 · release-notes-not-review | 3/3 | 3/3 | 0 | | +| eval-10 · dep-bump-cve | 5/5 | 5/5 | 0 | | +| eval-11 · ci-secret-hardcoded | 5/5 | 5/5 | 0 | | +| **eval-12 · multi-section-risks** | **8/8** | **7/8** | **+1** | Old skill misclassified DB migration PR as elevated-risk; new skill correctly does not | + +### Old-skill failure patterns (iter-3) + +| Eval | Assertion failed | Root cause | +|---|---|---| +| eval-1 | `should-not-apply-elevated-risk-to-api-rename` | Old skill lacked explicit "pure DTO rename ≠ elevated risk" boundary text | +| eval-12 | `should-not-apply-elevated-risk-overlay` | Old skill reasoned "DB migration = elevated risk" — text was ambiguous. Fix 3 + fix 7 (explicit boundary) resolved this. | + +--- + +## Trigger Re-evaluation v3 (post iter-3) + +**Eval set:** `evals/trigger-eval.json` — 21 queries (10 should-trigger, 11 should-not; +1 new n11) +**Method:** Manual simulation (reasoning-based; `claude` CLI not available) +**Results saved:** `evals/trigger-eval-results-v3.json` + +| Metric | v1 (pre-fix) | v2 (post t09/n02) | v3 (post iter-3) | +|---|---|---|---| +| Queries | 20 | 20 | **21** | +| Accuracy | 100% | 100% | **100%** | +| Precision | 100% | 100% | **100%** | +| Recall | 100% | 100% | **100%** | +| F1 | 100% | 100% | **100%** | +| Medium-confidence decisions | 2 | 0 | **0** | +| False Positives | 0 | 0 | **0** | +| False Negatives | 0 | 0 | **0** | + +### New query — n11 + +| ID | Query | Expected | Judgment | Confidence | +|---|---|---|---|---| +| n11 | *"Please draft release notes bullets from this diff only, do not do a PR review."* | NOT trigger | ✅ Correct | High | + +Two independent signals prevent triggering: (1) `Does NOT trigger` exclusion names release-notes-from-diff verbatim; (2) explicit "do not do a PR review" removes any ambiguity. + +--- + +## Iteration 4 — Qualitative Analysis (iter-3 state) + +**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-2 snapshot) = 24 runs +**Baseline:** `pr-review-workspace/skill-snapshot-iter3/` (iter-3 state before qualitative fixes) +**Assertions:** 67 total (new assertions for F1–F5 regression guards) + +Note: This iteration needed after applied review notes from reviewers. + +### Benchmark + +| Config | Pass Rate | Notes | +|---|---|---| +| **New skill (iter-3 state)** | **100.0% ± 0.0%** | | +| Old skill (iter-2 snapshot) | 97.3% ± 6.5% | | +| **Delta** | **+2.7 pp** | | + +### Qualitative issues identified (5 findings) + +| ID | Severity | Issue | Root cause | +|---|---|---|---| +| F1 | HIGH | "DB migrations" in elevated-risk trigger criteria too broad — every migration PR fires elevated risk, even though the DB migrations section already handles all risks | Keyword "DB migrations" listed without any qualifier | +| F2 | HIGH | Bumping a security library (`python-jose`, `bcrypt`) wrongly triggered elevated risk | "Touching auth controls" misread as any dep containing an auth library name | +| F3 | MEDIUM | "Trade-off analysis" appeared in `Applied sections:` header in output | Internal check listed as a named section | +| F4 | MEDIUM | `### Blocker` H3 headings used instead of `**Blocker**` bold in elevated-risk reviews | Format instruction existed but wasn't explicit enough for elevated-risk context | +| F5 | LOW | Widened deploy trigger sometimes surfaced as Important instead of Blocker | CI/CD section didn't call out the specific Blocker rule | + +--- + +## SKILL.md Fixes — Round 3 (F1–F5) + +Three targeted edits to `skills/pr-review/SKILL.md`: + +| Fixes | Section | Change | +|---|---|---| +| F1 + F2 | Elevated risk | Rewrote criteria: removed "DB migrations" and "wide refactor"; added 3 explicit edge-case notes (DB migrations → use DB migrations section; DTO rename → use API contracts section; security lib dep bump → use dependency bumps section) | +| F3 + F4 | Format | Added explicit bold-vs-H3 rule with fenced example; added exhaustive list of valid Applied-section names; added "do not list Trade-off analysis in header" instruction | +| F5 | CI/CD | Added explicit Blocker rule for widened deploy triggers | + +--- + +## Iteration 5 — New Skill vs Iter-3 Snapshot + +**Setup:** 12 evals × 2 configs (new_skill / old_skill iter-3 snapshot) = 24 runs +**Baseline:** `pr-review-workspace/skill-snapshot-iter3/` (pre-F1-F5 fixes) +**Assertions:** 67 total, including 5 new F1–F5 regression guards + +### Benchmark + +| Config | Pass Rate | Notes | +|---|---|---| +| **New skill (F1–F5 fixes)** | **100.0% ± 0.0%** | All 67 assertions pass | +| Old skill (iter-3 snapshot) | 90.6% ± 15.3% | 8 failures across 4 evals | +| **Delta** | **+9.4 pp** | Largest delta since iter-1 | + +### Per-eval breakdown + +| Eval | New skill | Old skill | Δ | Note | +|---|---|---|---|---| +| eval-1 · api-rename | 6/6 | 4/6 | +2 | Old skill used H3 headings (F4 regression) | +| eval-2 · iac-wildcard-iam | 5/5 | 5/5 | 0 | | +| eval-3 · ci-gate-bypass | 6/6 | 6/6 | 0 | | +| eval-4 · standard-clean-pr | 5/5 | 5/5 | 0 | | +| eval-5 · dependency-bump | 7/7 | 4/7 | +3 | Old skill: H3 headings + elevated-risk on dep bump (F2, F4) | +| eval-6 · db-migration-risks | 7/7 | 6/7 | +1 | Old skill: H3 headings (F4) | +| eval-7 · elevated-risk-auth | 8/8 | 8/8 | 0 | | +| eval-8 · large-pr-vague-desc | 5/5 | 5/5 | 0 | | +| eval-9 · release-notes-negative | 3/3 | 3/3 | 0 | | +| eval-10 · api-rename-paraphrase | 3/3 | 3/3 | 0 | | +| eval-11 · ci-paraphrase | 4/4 | 4/4 | 0 | | +| **eval-12 · multi-section** | **9/9** | **7/9** | **+2** | Old skill: H3 headings + elevated-risk on plain migration (F1, F4) | + +### F1–F5 regression verification + +| Fix | Assertion | Result | +|---|---|---| +| F1 — DB migrations ≠ elevated risk | `no-elevated-risk-on-plain-migration` (eval-12) | ✅ PASS | +| F2 — dep bump ≠ elevated risk | `no-elevated-risk-on-dep-bump` (eval-5) | ✅ PASS | +| F3 — trade-off not in header | `no-trade-off-in-header` (eval-7, eval-12) | ✅ PASS | +| F4 — no H3 headings | `no-h3-headings` (all evals) | ✅ PASS | +| F5 — widened deploy = Blocker | `deploy-trigger-blocker` (eval-3, eval-12) | ✅ PASS | + +--- + +## Trigger Re-evaluation v4 (post iter-5) + +**Eval set:** `evals/trigger-eval.json` — 21 queries (10 should-trigger, 11 should-not) +**Method:** Manual simulation (reasoning-based; `claude` CLI not available) +**Results saved:** `evals/trigger-eval-results-v4.json` + +| Metric | v1 | v2 | v3 | **v4** | +|---|---|---|---|---| +| Queries | 20 | 20 | 21 | **21** | +| Accuracy | 100% | 100% | 100% | **100%** | +| Precision | 100% | 100% | 100% | **100%** | +| Recall | 100% | 100% | 100% | **100%** | +| F1 | 100% | 100% | 100% | **100%** | +| Near-misses | 2 | 0 | 0 | **0** | +| False Positives | 0 | 0 | 0 | **0** | +| False Negatives | 0 | 0 | 0 | **0** | + +All 21 queries resolved at **high confidence**. The F1–F5 fixes (iter-5) did not introduce any trigger regressions — the description boundary between review tasks and generative/documentation tasks remains clean and stable. + +--- + +## Overall Trajectory + +``` +Baseline (no skill) 75.8% avg pass rate 5,028 tokens 20.6 s + ↓ +24.2 pp +Iteration 1 skill 100.0% 7,168 tokens 51.2 s + ↓ Round-1 fixes: DB Blocker, API completeness, + elevated-risk boundary, output format header +Iteration 2 skill 100.0% 7,433 tokens 65.7 s + vs old-skill baseline 97.5% 7,249 tokens 57.6 s + ↓ Trigger fixes: t09, n02 +Trigger accuracy v2 100% (20/20), 0 near-misses + ↓ Round-2 improvements: proactive refs, eval-12, + multi-section boundary, regression assertion +Iteration 3 skill 100.0% (12 evals, 24 runs) + vs old-skill baseline 95.6% ± 11.8% + Delta +4.4 pp + ↓ Trigger v3 re-evaluation (+n11) +Trigger accuracy v3 100% (21/21), 0 near-misses + ↓ Iter-4 qualitative analysis → 5 issues (F1–F5) + ↓ Round-3 fixes: elevated-risk boundary rewrite, + format anchor (bold vs H3), deploy-trigger Blocker +Iteration 5 skill 100.0% ± 0.0% (12 evals, 67 assertions) + vs iter-3 snapshot 90.6% ± 15.3% + Delta +9.4 pp ← largest qualitative delta + ↓ Trigger v4 re-evaluation (post iter-5) +Trigger accuracy v4 100% (21/21), 0 near-misses, all high confidence +``` + +### Files produced + +| File | Description | +|---|---| +| `evals/body testing review.md` | Full iteration-1 benchmark narrative with per-eval detail | +| `evals/trigger-eval-results.json` | Pre-fix trigger simulation results (20 queries) | +| `evals/trigger-eval-results-v2.json` | Post-fix trigger simulation results (20 queries) | +| `evals/trigger-eval-results-v3.json` | Post-iter-3 trigger simulation results (21 queries, +n11) | +| `evals/trigger-eval-results-v4.json` | Post-iter-5 trigger simulation results (21 queries, all high confidence) | +| `evals/files/multi-section-risks.diff` | New fixture for eval-12: alembic add+drop, API renames, CI tweak | +| `evals/results-summary.md` | This file | +| `pr-review-workspace/iteration-1/benchmark.json` | Iteration-1 aggregated benchmark | +| `pr-review-workspace/iteration-2/benchmark.json` | Iteration-2 aggregated benchmark | +| `pr-review-workspace/iteration-3/benchmark.json` | Iteration-3 aggregated benchmark | +| `pr-review-workspace/iteration-4/benchmark.json` | Iteration-4 aggregated benchmark | +| `pr-review-workspace/iteration-5/benchmark.json` | Iteration-5 aggregated benchmark (67 assertions, +9.4pp) | +| `pr-review-workspace/skill-snapshot-iter1/SKILL.md` | Skill snapshot before round-1 fixes | +| `pr-review-workspace/skill-snapshot-iter2/SKILL.md` | Skill snapshot before round-2 improvements | +| `pr-review-workspace/skill-snapshot-iter3/SKILL.md` | Skill snapshot before iter-4 qualitative fixes | +| `pr-review-workspace/skill-snapshot-iter4/SKILL.md` | Skill snapshot before iter-5 F1–F5 fixes | +| `pr-review-workspace/eval-review.html` | Static iteration-1 eval viewer | diff --git a/skills/pr-review/evals/trigger-eval.json b/skills/pr-review/evals/trigger-eval.json new file mode 100644 index 0000000..b26c03c --- /dev/null +++ b/skills/pr-review/evals/trigger-eval.json @@ -0,0 +1,146 @@ +[ + { + "id": "t01-review-pr-api-contract", + "query": "Review PR #42 for API contract changes before we merge.", + "should_trigger": true, + "reason": "Explicit PR review request with API-contract risk keywords." + }, + { + "id": "t02-check-diff-ci", + "query": "Can you check this diff for CI/CD pipeline risks?", + "should_trigger": true, + "reason": "Direct diff review ask plus CI/CD domain trigger." + }, + { + "id": "t03-lgtm-sanity-check", + "query": "LGTM? Please do a quick sanity check on these code changes.", + "should_trigger": true, + "reason": "Casual review signal — description cites 'LGTM?' as an example of informal requests that ALWAYS invoke the skill." + }, + { + "id": "t04-terraform-pr-review", + "query": "Please review this Terraform PR for security and least-privilege issues.", + "should_trigger": true, + "reason": "PR review request with infrastructure/security context." + }, + { + "id": "t05-db-migration-review", + "query": "Can you review this migration diff and call out rollback risks?", + "should_trigger": true, + "reason": "Asks for review of DB migration with risk assessment." + }, + { + "id": "t06-auth-refactor-feedback", + "query": "Need feedback on this auth refactor PR, especially permissions and secrets handling.", + "should_trigger": true, + "reason": "Explicit request for PR feedback on an auth refactor; matches 'Invoke whenever: pre-merge approval feedback'. Note: the word 'refactor' describes the PR content, not a request for the model to refactor — the 'refactoring tasks' exclusion applies when the model is asked to do the refactoring, not when reviewing a PR that contains one." + }, + { + "id": "t07-dependency-bump-review", + "query": "Review this dependency bump PR and tell me if any upgrades look dangerous.", + "should_trigger": true, + "reason": "PR review intent and dependency-bump risk language." + }, + { + "id": "t08-pr-link-thoughts", + "query": "Here is the PR link, can you give review comments before I approve it?", + "should_trigger": true, + "reason": "Mentions PR link and asks for review comments directly." + }, + { + "id": "t09-rename-field-risk", + "query": "Does this look right? I renamed a response field in an endpoint; please review for client breakage.", + "should_trigger": true, + "reason": "Paraphrased review ask with API-contract breakage concern." + }, + { + "id": "t10-workflow-guardrails", + "query": "Can you sanity check this workflow file change for gate bypasses before merge?", + "should_trigger": true, + "reason": "Paraphrased PR review plus CI gate-bypass signal." + }, + { + "id": "n01-implement-feature", + "query": "Implement OAuth login in FastAPI with refresh tokens and tests.", + "should_trigger": false, + "reason": "Writing new code — explicitly listed in 'Does NOT invoke for: writing new code'." + }, + { + "id": "n02-write-unit-tests", + "query": "Write unit tests for src/utils/date_utils.py.", + "should_trigger": false, + "reason": "Writing new code — explicitly listed in 'Does NOT invoke for: writing new code'." + }, + { + "id": "n03-generate-release-notes", + "query": "Generate release notes from this diff in bullet format.", + "should_trigger": false, + "reason": "Documentation summarization task, not review feedback." + }, + { + "id": "n04-explain-error", + "query": "Why am I getting 'ModuleNotFoundError: psycopg2' in CI?", + "should_trigger": false, + "reason": "Debugging an error — explicitly listed in 'Does NOT invoke for: debugging errors'." + }, + { + "id": "n05-open-issue", + "query": "Create a GitHub issue for flaky integration tests.", + "should_trigger": false, + "reason": "Issue creation request, should route to create-issue skill." + }, + { + "id": "n06-refactor-function", + "query": "Refactor this function for readability and lower cyclomatic complexity.", + "should_trigger": false, + "reason": "Asking the model to do a refactor — explicitly listed in 'Does NOT invoke for: refactoring tasks'." + }, + { + "id": "n07-kudos-message", + "query": "Give kudos to Sarah for helping with incident response.", + "should_trigger": false, + "reason": "Recognition workflow request, not code review." + }, + { + "id": "n08-crossall-status", + "query": "Generate the crossall status update for this sprint from Jira.", + "should_trigger": false, + "reason": "Project status reporting task, unrelated to PR review." + }, + { + "id": "n09-translate-text", + "query": "Translate this paragraph to French.", + "should_trigger": false, + "reason": "Language translation task; no overlap with PR review." + }, + { + "id": "n10-math-help", + "query": "What is the derivative of x^3 * sin(x)?", + "should_trigger": false, + "reason": "General Q&A request; outside software review scope." + }, + { + "id": "n11-release-notes-from-diff", + "query": "Please draft release notes bullets from this diff only, do not do a PR review", + "should_trigger": false, + "reason": "Explicit instruction NOT to do a PR review; generating release notes/changelogs from a diff is listed in 'Does NOT invoke for'." + }, + { + "id": "t11-skill-md-pr-review", + "query": "Review this new skill PR before I merge it into the toolkit.", + "should_trigger": true, + "reason": "Explicit PR review request; when diff contains a SKILL.md, the Skill definitions conditional section fires." + }, + { + "id": "t12-check-skill-conventions", + "query": "Can you check if this SKILL.md follows our naming and description conventions?", + "should_trigger": true, + "reason": "Asking for a review/check of a SKILL.md — maps to the Skill definitions conditional section within pr-review." + }, + { + "id": "n12-apply-skill-to-context", + "query": "Load this updated SKILL.md into your context and use it going forward.", + "should_trigger": false, + "reason": "Instruction to apply or load a skill into context — no review feedback requested; not a PR or diff review." + } +] \ No newline at end of file diff --git a/skills/pr-review/references/output-template.md b/skills/pr-review/references/output-template.md new file mode 100644 index 0000000..b22cc62 --- /dev/null +++ b/skills/pr-review/references/output-template.md @@ -0,0 +1,90 @@ +# PR Review Output Examples + +Use these as a template for formatting your review. The goal is to be useful, not thorough — a short focused review is better than a long one that buries the signal. + +--- + +## Example 1: Clean PR (LGTM) + +```markdown +## Standard review + +Files changed: `src/utils/date_utils.py`, `tests/utils/test_date_utils.py` + +**Blocker** +_None._ + +**Important** +_None._ + +**Nit** +- `get_business_days` uses an O(n) loop. A week-math formula is O(1) and simpler — worth + switching if this is called on large date ranges. Not blocking. + +Overall: LGTM. Clean utility, good test coverage, no dependencies added. +``` + +--- + +## Example 2: Multi-section review (API + CI/CD) + +```markdown +## Applied sections: Standard · API contracts · CI/CD + +**Blocker** + +1. `GET /users/{id}` response field rename — breaking contract + - **What**: `user_id` → `userId` changes the JSON wire format with no backward-compat alias. + - **Why**: Every existing caller receives `null` for `user_id` the moment this deploys. + - **How to fix**: Use `Field(alias="user_id")` to keep the old wire name, or version the endpoint. + +2. Unit tests commented out in CI + - **What**: `pytest` step disabled in `.github/workflows/ci.yml` (line 28). + - **Why**: Coverage gate bypassed — broken code can reach staging undetected. + - **How to fix**: Fix the specific flaky tests; don't disable the gate. + +**Important** + +3. Deploy trigger widened to all pushes + - **What**: `if: github.ref == 'refs/heads/develop'` changed to `github.event_name == 'push'`. + - **Why**: Staging now deploys on `main` pushes too, breaking the expected promotion flow. + - **How to fix**: Restore the branch-scoped guard. + +**Nit** +_None._ +``` + +--- + +## Example 3: Elevated risk PR (auth/security change) + +```markdown +## Applied sections: Standard · Elevated risk · API contracts + +**Blocker** + +1. Wildcard IAM policy (`Action: "*"`, `Resource: "*"`) + - **What**: `lambda_data_access` policy grants unrestricted access to the entire AWS account. + - **Why**: A compromised Lambda can exfiltrate data, destroy resources, or escalate privileges. + - **How to fix**: Scope to the specific actions and resource ARNs the function actually needs. + +**Important** + +2. Hardcoded account ID as variable default + - **What**: `variable "account_id" { default = "123456789012" }` — silently targets a single account. + - **Why**: Any caller that forgets to override will deploy against the wrong account. + - **How to fix**: Remove the default; use `data "aws_caller_identity" "current" {}` instead. + +**Nit** +- `Sid = "FullDataAccess"` is misleading — once scoped, rename to reflect actual access. +``` + +--- + +## Formatting rules + +- Reference file + line number where possible: `src/api/router.py:42` +- Quote the actual code in a fenced block when it helps clarity +- If a section has no findings, write `_None._` explicitly — it shows you checked +- Keep the "How to fix" minimal: a code snippet or a pattern name is enough; don't write an essay +- Omit sections entirely if they don't apply to this PR diff --git a/skills/pr-review/references/security-antipatterns.md b/skills/pr-review/references/security-antipatterns.md new file mode 100644 index 0000000..9d146d2 --- /dev/null +++ b/skills/pr-review/references/security-antipatterns.md @@ -0,0 +1,67 @@ +# Security Anti-Patterns Reference + +Quick reference of patterns to actively look for during PR review. Not exhaustive — use these as prompts when reading code, not as a checklist to run down mechanically. + +--- + +## Credentials & Secrets + +| Pattern | What to look for | +|---|---| +| Hardcoded creds | String literals matching `AKIA`, `sk-`, `ghp_`, `-----BEGIN`, passwords/tokens in assignments | +| Secret in env block | `env:` in CI with literal values instead of `${{ secrets.X }}` | +| Secret in log | `print(token)`, `logger.info(password)`, `console.log(apiKey)` | +| Secret in URL | `https://user:pass@host`, connection strings with embedded credentials | +| `.env` committed | `.env`, `.env.local`, `.env.production` added to the diff | + +## Injection + +| Pattern | What to look for | +|---|---| +| SQL injection | String-formatted queries: `f"SELECT * FROM {table}"`, `"WHERE id=" + user_input` | +| Command injection | `os.system(input)`, `subprocess.run(f"cmd {input}", shell=True)`, `exec(user_data)` | +| Template injection | User-controlled strings passed to Jinja2/Mustache/eval without escaping | +| Path traversal | `open(base_dir + user_path)` without `Path.resolve()` / `realpath()` check | + +## Auth & Access Control + +| Pattern | What to look for | +|---|---| +| Missing auth check | Endpoint added without `@require_auth`, `@login_required`, or middleware check | +| Broken object-level auth | Fetching a resource by ID without verifying the caller owns it | +| Privilege escalation | Role/permission assigned from user-supplied input without validation | +| JWT `alg: none` | JWT verification that accepts unsigned tokens | +| Wildcard IAM | `Action: "*"` or `Resource: "*"` in IAM policies | + +## Cryptography + +| Pattern | What to look for | +|---|---| +| Weak hash | `md5(password)`, `sha1(secret)` — use bcrypt/argon2 for passwords | +| Static IV/nonce | Same IV reused across encryptions | +| Insecure random | `random.random()` / `Math.random()` for security tokens — use `secrets` / `crypto.randomBytes` | + +## Input Validation + +| Pattern | What to look for | +|---|---| +| Missing size limit | File upload or request body accepted without size cap | +| Unvalidated redirect | `redirect(request.args['next'])` without allow-list check | +| XXE | XML parser without `resolve_entities=False` / external entity disabled | +| SSRF | HTTP client called with user-controlled URL without allow-list | + +## Data Handling + +| Pattern | What to look for | +|---|---| +| PII in logs | Email, phone, card number, SSN in log statements | +| PII in URL | Sensitive identifiers in query params (captured by access logs, proxies, browsers) | +| Insecure deserialization | `pickle.loads(user_data)`, `yaml.load()` without `Loader=SafeLoader` | + +--- + +## Severity guidance + +- **Blocker**: Exploitable in production, exposes credentials/data, bypasses auth +- **Important**: Increases attack surface, violates least privilege, missing validation on a sensitive path +- **Nit**: Theoretical/low-impact, no immediate exploitability diff --git a/skills/pr-review/scripts/classify_sections.py b/skills/pr-review/scripts/classify_sections.py new file mode 100644 index 0000000..6fb1cff --- /dev/null +++ b/skills/pr-review/scripts/classify_sections.py @@ -0,0 +1,132 @@ +#!/usr/bin/env python3 +""" +classify_sections.py — Given a list of changed files, print which pr-review +conditional sections apply. + +Usage: + # From a file list (one path per line): + python3 classify_sections.py /tmp/pr_files.txt + + # Or pipe from gh: + gh pr view 42 --json files --jq '.files[].path' | python3 classify_sections.py - + +Output example: + Sections to apply: + [x] Standard review (always) + [x] API contracts (router.py, schemas.py) + [ ] Elevated risk + [ ] Dependency bumps + [ ] CI/CD + [ ] Infrastructure +""" + +import sys +import re +from pathlib import Path + + +SECTION_RULES = [ + ( + "Elevated risk", + [ + # Word-boundary patterns prevent false positives on paths like + # authentication_helper_test.py, AccessibleButton.tsx, or access-logging.md. + r"\bauth\b", r"\bsecurity\b", r"\bpermission\b", r"\baccess\b", + r"\bmigration\b", r"\balembic\b", r"\bflyway\b", r"\bliquibase\b", + r"\bintegration\b", r"\binfra(?:structure)?\b", r"\biac\b", + ], + ), + ( + "API contracts", + [ + r"router", r"endpoint", r"controller", r"view", + r"schema", r"dto", r"serializer", + r"\.env", r"config\.(py|ts|js|yaml|yml|json)", + r"action\.ya?ml", # GitHub Actions input/output + ], + ), + ( + "Dependency bumps", + [ + r"pom\.xml$", r"requirements.*\.txt$", r"pyproject\.toml$", + r"package\.json$", r"package-lock\.json$", r"yarn\.lock$", + r"poetry\.lock$", r".*\.csproj$", r"go\.mod$", r"go\.sum$", + r"build\.sbt$", r"Cargo\.toml$", r"Gemfile", r"composer\.json$", + r"pubspec\.yaml$", + ], + ), + ( + "CI/CD", + [ + r"\.github/workflows/", r"Jenkinsfile", r"Justfile", + # Match only exact Makefile name (not MakefileHelper.py etc.). + # Note: still a broad heuristic — only applies when Makefile is a pipeline + # entrypoint. A manual check is recommended when this fires on Makefile alone. + r"(?:^|/)Makefile$", + r"\.circleci/", r"\.travis\.yml$", + r"azure-pipelines\.yml$", r"bitbucket-pipelines\.yml$", + ], + ), + ( + "Infrastructure", + [ + r"\.tf$", r"\.tfvars$", r"terragrunt\.hcl$", + r"cloudformation", r"helm", r"Dockerfile", + r"docker-compose", r"iac/", r"infra/", + # Kubernetes / Helm assets + r"(?:^|/)Chart\.yaml$", r"(?:^|/)values(?:[^/]*)\.yaml$", + r"(?:^|/)charts/", r"(?:^|/)k8s/", r"\.k8s\.", r"kubernetes/", r"manifests/", + ], + ), + ( + "DB migrations", + [ + r"alembic/", r"migrations/", r"flyway/", r"liquibase/", + r"\.sql$", r"db/", r"database/", + ], + ), +] + + +def classify(files: list[str]) -> dict[str, list[str]]: + triggered: dict[str, list[str]] = {} + for section, patterns in SECTION_RULES: + matched = [ + f for f in files + if any(re.search(p, f, re.IGNORECASE) for p in patterns) + ] + if matched: + triggered[section] = matched + return triggered + + +def main() -> None: + source = sys.argv[1] if len(sys.argv) > 1 else "-" + if source == "-": + files = [line.strip() for line in sys.stdin if line.strip()] + else: + files = Path(source).read_text().splitlines() + files = [f.strip() for f in files if f.strip()] + + if not files: + print("No files provided.", file=sys.stderr) + sys.exit(1) + + triggered = classify(files) + + all_sections = [s for s, _ in SECTION_RULES] + print("Sections to apply:") + print(f" [x] Standard review (always)") + for section in all_sections: + if section in triggered: + examples = triggered[section][:3] + label = ", ".join(examples) + if len(triggered[section]) > 3: + label += f" (+{len(triggered[section]) - 3} more)" + print(f" [x] {section:<22} ({label})") + else: + print(f" [ ] {section}") + + +if __name__ == "__main__": + main() diff --git a/skills/pr-review/scripts/fetch_pr.sh b/skills/pr-review/scripts/fetch_pr.sh new file mode 100755 index 0000000..b364009 --- /dev/null +++ b/skills/pr-review/scripts/fetch_pr.sh @@ -0,0 +1,37 @@ +#!/usr/bin/env bash +# fetch_pr.sh — Fetch PR diff and changed file list using the gh CLI. +# +# Usage: +# ./fetch_pr.sh [REPO] +# +# Examples: +# ./fetch_pr.sh 42 +# ./fetch_pr.sh 42 owner/repo +# +# Output: +# Prints the PR description, then the diff to stdout. +# Writes the changed file list to /tmp/pr_files.txt for use by classify_sections.py. +# +# Requirements: gh CLI authenticated (gh auth status) + +set -euo pipefail + +PR="${1:?Usage: fetch_pr.sh [REPO]}" +REPO_ARGS=() +[ -n "${2:-}" ] && REPO_ARGS=("--repo" "$2") + +echo "=== PR Description ===" +gh pr view "$PR" "${REPO_ARGS[@]}" --json title,body,author \ + --template '# {{.title}} +Author: {{.author.login}} + +{{.body}}' + +echo "" +echo "=== Changed Files ===" +gh pr view "$PR" "${REPO_ARGS[@]}" --json files \ + --jq '.files[].path' | tee /tmp/pr_files.txt + +echo "" +echo "=== Diff ===" +gh pr diff "$PR" "${REPO_ARGS[@]}" diff --git a/skills/token-saving/SKILL.md b/skills/token-saving/SKILL.md new file mode 100644 index 0000000..473b75d --- /dev/null +++ b/skills/token-saving/SKILL.md @@ -0,0 +1,66 @@ +--- +name: token-saving +description: > + Always-active response formatting rules — invoke for every reply without exception: coding + questions, code generation, debugging, explanations, comparisons, reviews, diffs, PR updates, + recaps, summaries, workflow tasks, non-technical requests, and anything else. Also invoke on + explicit brevity signals: be concise, keep it short, save tokens, too verbose, shorter, terse, + brief, no fluff, summarise, can you make that shorter. Rules: no filler openers (Certainly!, + Great question!, Happy to help!); no closing platitudes (Let me know if you have questions!); + concise within line limits; skip restating prior context; prefer tables/bullets over prose; + append What changed / Why / How to verify footer only for code-output responses, not Q&A, + reviews, or planning. Boundary: when user explicitly requests full detail, deep dive, complete + explanation, or says "don't hold back", length rules suspend — respond fully. Another active + skill's more specific format requirements take precedence. +--- + +# Token-Saving + +Always-active base behaviour. Apply to every response without exception unless the user explicitly requests verbosity. + +## Always apply — response discipline + +- Default to the shortest response that fully answers the question +- Factual or conceptual answers: aim for ≤ 5 prose lines; one minimal code block is permitted and does not count toward that limit +- Action lists and next-step recommendations: cap at 4 bullets; no header line before the list +- Must not repeat context already established in the conversation +- Must not pad responses with preamble ("Great question!", "Certainly!", "As an AI...") +- Must not add closing summaries that restate what was just said +- Stop when the task is complete — must not append "let me know if you need anything else" filler +- Prefer structured output when it improves clarity: bullets, tables, and short code blocks over dense prose +- If another active skill or task requires a more specific output format, that format takes precedence + +## Format code output responses + +End every response where you output code for the user to incorporate — new functions, patches, inline diffs, config snippets, or any code block that represents a change — with exactly this structure (no more, no less): + +``` +**What changed:** +**Why:** +**How to verify:** +``` + +This footer does NOT apply to pure Q&A, reviews, planning, comparisons, or conceptual explanations — only when you are writing or changing code. + +When applying or confirming a bug fix: always show the changed line(s) or a minimal diff, then the footer. A prose description of a code change without showing the code is not sufficient. + +- Must not paste full file contents unless the user explicitly asks +- Show diffs or changed sections only +- Include enough surrounding context for the change to be unambiguous + +## Keep summaries and recaps concise + +- Aim for ≤ 10 lines in any recap +- Prefer linking to files/lines over quoting large blocks +- Use bullet lists over paragraphs +- Summarise deltas — what is different — not what already existed + +## Update PR bodies by appending only + +- Treat the PR description as a changelog — append only, never rewrite +- Append under `## Update YYYY-MM-DD` with the commit hash — use today's date from your system context (the current date, not a guessed or example date) +- Must not delete prior update sections + +## Respond fully when detail is explicitly requested + +If the user explicitly asks for a full explanation, rationale, or deep dive — ALL rules in this skill are suspended for that response. Cover every step, concept, and detail without omitting any part of the topic. Do not apply line limits, bullet caps, or summarisation. diff --git a/skills/token-saving/evals/evals.json b/skills/token-saving/evals/evals.json new file mode 100644 index 0000000..9fe8459 --- /dev/null +++ b/skills/token-saving/evals/evals.json @@ -0,0 +1,186 @@ +{ + "skill_name": "token-saving", + "evals": [ + { + "id": 1, + "category": "happy-path", + "prompt": "What is the purpose of a Dockerfile ENTRYPOINT vs CMD?", + "files": [], + "expected_output": "A concise answer (2\u20135 lines) explaining the difference. No preamble, no closing filler.", + "expectations": [ + "Response does not begin with filler openers: 'Great question!', 'Certainly!', 'Sure!', 'Of course!', 'As an AI...', 'Happy to help!' or equivalent", + "Response is \u2264 5 prose lines (a single code block, if present, does not count toward this limit)", + "Response does not end with closing filler such as 'Let me know if you need anything else', 'Feel free to ask', or 'Hope that helps!'", + "Response correctly explains ENTRYPOINT (fixed executable) vs CMD (default overridable args) without padding", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — pure Q&A, not a code output response" + ] + }, + { + "id": 2, + "category": "happy-path", + "prompt": "Fix the window comparison bug in the rate limiter. The diff is at evals/files/rate-limiter-fix.diff \u2014 apply the fix and confirm.", + "files": [ + "evals/files/rate-limiter-fix.diff" + ], + "expected_output": "Agent confirms the fix and ends with the exact 3-line footer: **What changed**, **Why**, **How to verify**. Shows only the changed line, not the full file.", + "expectations": [ + "Response ends with exactly the footer: '**What changed:** ', '**Why:** ', '**How to verify:** '", + "Response does NOT include the full contents of rate_limiter.py", + "Response shows only the changed line or a short diff/snippet of the fixed function", + "No preamble opener ('Certainly! I'll fix that for you...')", + "No closing filler after the footer" + ] + }, + { + "id": 3, + "category": "regression", + "prompt": "Context from prior conversation is in evals/files/prior-conversation.txt. Now that we've covered the architecture \u2014 what's the concrete next step I should take?", + "files": [ + "evals/files/prior-conversation.txt" + ], + "expected_output": "Agent gives a direct next-step recommendation without re-summarising the already-established context (stack, architecture decisions).", + "expectations": [ + "Response does NOT restate the tech stack or architecture decisions already established in prior-conversation.txt", + "Response does NOT open with 'As we discussed...' or 'To summarise what we covered...' or 'So you have a FastAPI app with...'", + "Response jumps directly to the next-step recommendation", + "Response is \u2264 5 lines", + "No closing filler line" + ] + }, + { + "id": 4, + "category": "regression", + "prompt": "Fix the validate_email bug in evals/files/large-user-service.py \u2014 it currently allows empty strings through. Fix only that function.", + "files": [ + "evals/files/large-user-service.py" + ], + "expected_output": "Agent shows only the fixed validate_email function or a targeted diff of that function. Does not paste the entire large-user-service.py.", + "expectations": [ + "Response does NOT include the full contents of large-user-service.py", + "Response shows only the fixed validate_email function or a targeted diff of that function", + "Response ends with **What changed** / **Why** / **How to verify** footer", + "No preamble opener", + "No closing filler after the footer" + ] + }, + { + "id": 5, + "category": "happy-path", + "prompt": "Give me a concise recap of what changed in the user service this sprint. Use evals/files/sprint-changelog.txt as input.", + "files": [ + "evals/files/sprint-changelog.txt" + ], + "expected_output": "A recap in \u2264 10 bullet lines that summarises the deltas (what changed). Does not quote large blocks from the changelog verbatim.", + "expectations": [ + "Recap is \u2264 10 lines total", + "Uses bullet list format, not prose paragraphs", + "Summarises what changed (deltas), not what existed before", + "Does NOT quote large blocks from sprint-changelog.txt verbatim", + "No closing filler line" + ] + }, + { + "id": 6, + "category": "happy-path", + "prompt": "Update the PR description to reflect the latest commit (abc1234): we added input sanitisation to the registration endpoint. Current PR body is in evals/files/pr-description-existing.md.", + "files": [ + "evals/files/pr-description-existing.md" + ], + "expected_output": "Agent appends a new '## Update 2026-05-15' section with commit abc1234. Does not rewrite or delete any prior section.", + "expectations": [ + "Agent appends a new section headed '## Update YYYY-MM-DD' \u2014 date matches current date", + "The commit hash abc1234 is included in the appended section", + "The existing PR description body and prior '## Update' sections are NOT modified or deleted", + "Agent does NOT rewrite the entire PR description from scratch", + "No closing filler line" + ] + }, + { + "id": 7, + "category": "negative", + "prompt": "Please give me a full in-depth explanation of how OAuth 2.0 Authorization Code Flow works \u2014 I want every step covered in detail.", + "files": [], + "expected_output": "Agent provides a complete, detailed explanation covering all steps. Conciseness rules are suspended because the user explicitly requested full detail.", + "expectations": [ + "Agent provides a comprehensive explanation of ALL steps: client registration, redirect URI, auth request, user consent, auth code, token exchange, token use", + "Response is NOT artificially truncated \u2014 user explicitly asked for all details", + "Agent does NOT cite conciseness or brevity rules to justify shortening the response", + "Response does NOT end with the code-change footer (not a code change task)" + ] + }, + { + "id": 8, + "category": "regression", + "prompt": "Write a Python function to check if a number is prime.", + "files": [], + "expected_output": "Agent writes a correct, concise isPrime function and ends with the **What changed** / **Why** / **How to verify** footer. No preamble.", + "expectations": [ + "Response does NOT open with 'Certainly!', 'Sure!', 'Great!' or any preamble", + "Function is correct \u2014 returns False for n < 2, uses trial division up to sqrt(n) or equivalent", + "Response ends with **What changed** / **Why** / **How to verify** footer", + "Response does NOT end with 'Let me know if you want me to add tests!' or similar filler", + "Response does NOT explain every line of the function with excessive inline commentary" + ] + }, + { + "id": 9, + "category": "paraphrase", + "prompt": "Quick answer only \u2014 what's the difference between a process and a thread?", + "files": [], + "expected_output": "A concise answer (\u2264 4 lines) on the key difference. No acknowledgement of the 'quick answer' request \u2014 it just IS quick.", + "expectations": [ + "Response is \u2264 4 lines", + "Response correctly explains the key difference (separate memory space vs shared memory / OS-scheduled vs cooperatively scheduled)", + "Response does NOT open by acknowledging 'Quick answer:' or 'Here's a quick answer:' \u2014 the terseness is implicit", + "No preamble opener", + "No closing filler", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — pure Q&A, not a code output response" + ] + }, + { + "id": 10, + "category": "edge-case", + "prompt": "We've been working on this feature for the past hour. Here's what we've done so far: set up the FastAPI router, added the Pydantic request/response schemas, wired in the database session dependency, and wrote the POST /users endpoint. What should we tackle next?", + "files": [], + "expected_output": "Agent gives a direct next-step suggestion without restating the four things the user just listed.", + "expectations": [ + "Response does NOT restate or re-list the four completed items (router, schemas, DB session, POST endpoint)", + "Response gives a concrete, actionable next step (e.g. tests, auth middleware, error handling, GET endpoint)", + "Response is \u2264 4 lines", + "No preamble opener", + "No closing filler", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — planning response, not a code output" + ] + }, + { + "id": 11, + "category": "regression", + "prompt": "Review this diff and tell me if the fix looks correct: evals/files/rate-limiter-fix.diff", + "files": [ + "evals/files/rate-limiter-fix.diff" + ], + "expected_output": "Agent reviews the diff and confirms whether the fix is correct. Does NOT append the code-change footer — this is a code review, not a code output response.", + "expectations": [ + "Response confirms the fix is correct (t > window_start correctly evicts timestamps outside the sliding window)", + "Response does NOT end with **What changed** / **Why** / **How to verify** footer — this is a review task, not a code output response", + "Response does not paste the full contents of rate_limiter.py", + "No preamble opener", + "No closing filler" + ] + }, + { + "id": 12, + "category": "happy-path", + "prompt": "Compare synchronous vs asynchronous SQLAlchemy sessions — when should I use each?", + "files": [], + "expected_output": "Agent answers with structured output (table or clearly delineated bullet comparison) rather than a dense prose paragraph. Concise, within line limits.", + "expectations": [ + "Response uses structured output — a comparison table or clearly delineated bullet sections — not a prose paragraph", + "Response is concise and does not expand into a full SQLAlchemy tutorial", + "No preamble opener", + "No closing filler", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — pure Q&A comparison" + ] + } + ] +} \ No newline at end of file diff --git a/skills/token-saving/evals/files/large-user-service.py b/skills/token-saving/evals/files/large-user-service.py new file mode 100644 index 0000000..0fc1395 --- /dev/null +++ b/skills/token-saving/evals/files/large-user-service.py @@ -0,0 +1,185 @@ +"""User service — business logic layer for user management.""" +from __future__ import annotations + +import hashlib +import re +import uuid +from datetime import datetime, timedelta +from typing import List, Optional + +from sqlalchemy.orm import Session + +from src.models.user import User +from src.schemas.user import UserCreate, UserUpdate +from src.core.security import hash_password, verify_password +from src.core.exceptions import UserNotFoundError, DuplicateEmailError + + +# --------------------------------------------------------------------------- +# Internal helpers +# --------------------------------------------------------------------------- + +_EMAIL_RE = re.compile(r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$") +_PASSWORD_MIN_LEN = 8 + + +def validate_email(email: str) -> bool: + """Return True if *email* is a syntactically valid e-mail address. + + Bug: currently returns True for empty strings because the regex is only + checked when email is truthy. Fix: add an explicit empty-string guard. + """ + # BUG: missing `if not email: return False` — empty string passes through + return bool(_EMAIL_RE.match(email)) + + +def validate_password(password: str) -> bool: + """Return True if password meets minimum length and complexity rules.""" + if not password or len(password) < _PASSWORD_MIN_LEN: + return False + has_upper = any(c.isupper() for c in password) + has_digit = any(c.isdigit() for c in password) + return has_upper and has_digit + + +# --------------------------------------------------------------------------- +# UserService +# --------------------------------------------------------------------------- + +class UserService: + def __init__(self, db: Session) -> None: + self.db = db + + # ------------------------------------------------------------------ + # Read + # ------------------------------------------------------------------ + + def get_user(self, user_id: str) -> User: + user = self.db.query(User).filter(User.id == user_id).first() + if not user: + raise UserNotFoundError(user_id) + return user + + def get_user_by_email(self, email: str) -> Optional[User]: + return self.db.query(User).filter(User.email == email).first() + + def get_users(self, skip: int = 0, limit: int = 50) -> List[User]: + return ( + self.db.query(User) + .filter(User.is_active.is_(True)) + .offset(skip) + .limit(limit) + .all() + ) + + def search_users(self, query: str, limit: int = 20) -> List[User]: + pattern = f"%{query}%" + return ( + self.db.query(User) + .filter( + (User.email.ilike(pattern)) | (User.display_name.ilike(pattern)) + ) + .limit(limit) + .all() + ) + + # ------------------------------------------------------------------ + # Write + # ------------------------------------------------------------------ + + def register(self, payload: UserCreate) -> User: + if not validate_email(payload.email): + raise ValueError(f"Invalid email address: {payload.email!r}") + if not validate_password(payload.password): + raise ValueError("Password does not meet complexity requirements.") + if self.get_user_by_email(payload.email): + raise DuplicateEmailError(payload.email) + + user = User( + id=str(uuid.uuid4()), + email=payload.email.lower().strip(), + password_hash=hash_password(payload.password), + display_name=payload.display_name, + created_at=datetime.utcnow(), + is_active=True, + ) + self.db.add(user) + self.db.commit() + self.db.refresh(user) + return user + + def update_profile(self, user_id: str, payload: UserUpdate) -> User: + user = self.get_user(user_id) + if payload.display_name is not None: + user.display_name = payload.display_name + if payload.email is not None: + if not validate_email(payload.email): + raise ValueError(f"Invalid email address: {payload.email!r}") + existing = self.get_user_by_email(payload.email) + if existing and existing.id != user_id: + raise DuplicateEmailError(payload.email) + user.email = payload.email.lower().strip() + user.updated_at = datetime.utcnow() + self.db.commit() + self.db.refresh(user) + return user + + def change_password( + self, user_id: str, old_password: str, new_password: str + ) -> None: + user = self.get_user(user_id) + if not verify_password(old_password, user.password_hash): + raise ValueError("Current password is incorrect.") + if not validate_password(new_password): + raise ValueError("New password does not meet complexity requirements.") + user.password_hash = hash_password(new_password) + user.updated_at = datetime.utcnow() + self.db.commit() + + def deactivate_user(self, user_id: str) -> None: + user = self.get_user(user_id) + user.is_active = False + user.deactivated_at = datetime.utcnow() + self.db.commit() + + def reactivate_user(self, user_id: str) -> None: + user = self.get_user(user_id) + user.is_active = True + user.deactivated_at = None + user.updated_at = datetime.utcnow() + self.db.commit() + + # ------------------------------------------------------------------ + # Auth helpers + # ------------------------------------------------------------------ + + def generate_password_reset_token(self, email: str) -> Optional[str]: + user = self.get_user_by_email(email) + if not user or not user.is_active: + return None + token = hashlib.sha256( + f"{user.id}{user.password_hash}{datetime.utcnow().isoformat()}".encode() + ).hexdigest() + user.reset_token = token + user.reset_token_expires = datetime.utcnow() + timedelta(hours=1) + self.db.commit() + return token + + def consume_password_reset_token(self, token: str, new_password: str) -> bool: + user = ( + self.db.query(User) + .filter(User.reset_token == token) + .first() + ) + if not user: + return False + if user.reset_token_expires < datetime.utcnow(): + return False + if not validate_password(new_password): + raise ValueError("New password does not meet complexity requirements.") + user.password_hash = hash_password(new_password) + user.reset_token = None + user.reset_token_expires = None + user.updated_at = datetime.utcnow() + self.db.commit() + return True diff --git a/skills/token-saving/evals/files/pr-description-existing.md b/skills/token-saving/evals/files/pr-description-existing.md new file mode 100644 index 0000000..217555e --- /dev/null +++ b/skills/token-saving/evals/files/pr-description-existing.md @@ -0,0 +1,20 @@ +## Summary + +Implements the user registration and authentication endpoints for the platform API. + +Adds: +- `POST /auth/register` — creates a new user account with email + password +- `POST /auth/login` — returns a JWT access token on successful credentials +- `POST /auth/logout` — invalidates the current session token in Redis +- Input validation via Pydantic v2 schemas (email format, password complexity) +- Password hashing with bcrypt + +Migrations included. All endpoints covered by integration tests (pytest + TestClient). + +--- + +## Update 2026-05-12 · commit 9f3a21b + +- Added `POST /auth/password-reset/request` and `POST /auth/password-reset/confirm` +- Reset tokens expire after 1 hour; stored hashed in DB +- 8 new unit tests added for reset flow edge cases (expired, reuse, invalid token) diff --git a/skills/token-saving/evals/files/prior-conversation.txt b/skills/token-saving/evals/files/prior-conversation.txt new file mode 100644 index 0000000..1490b10 --- /dev/null +++ b/skills/token-saving/evals/files/prior-conversation.txt @@ -0,0 +1,27 @@ +# Prior conversation — architecture decisions (established context) +# Used as fixture for eval test 3: agent must NOT restate this context in its next reply. + +--- + +User: I'm starting a new backend project. What tech stack would you recommend for a high-throughput REST API with a relational database? +Agent: For a high-throughput REST API with a relational database I'd go with: +- FastAPI (Python) — async-native, fast, great OpenAPI support out of the box +- PostgreSQL — solid relational choice with good JSON/JSONB support if you need it +- Alembic — standard migration tool for SQLAlchemy-based stacks +- Pydantic v2 — schema validation, already bundled with FastAPI +- Redis — for caching and rate-limiting +- AWS ECS (Fargate) — good managed container option if you're on AWS + +User: Perfect, let's go with that. I want to use async SQLAlchemy sessions and run the service on port 8000. + +Agent: Confirmed — async SQLAlchemy with asyncpg driver, port 8000. Standard session factory pattern with a dependency-injected AsyncSession per request. + +User: Great. I've got the project scaffolded: FastAPI app instance created, PostgreSQL async engine configured, Alembic initialised with the first baseline migration, Pydantic v2 schemas for the user model, and Redis client wired in. The service is listening on port 8000. + +Agent: Solid foundation. You have all the core infrastructure pieces in place. + +--- +# End of established context. The eval prompt asks: "what's the concrete next step?" +# A correct response jumps straight to the recommendation — it does NOT restate: +# - the stack (FastAPI, PostgreSQL, Redis, Alembic, Pydantic v2, port 8000) +# - the architecture decisions already locked in above diff --git a/skills/token-saving/evals/files/rate-limiter-fix.diff b/skills/token-saving/evals/files/rate-limiter-fix.diff new file mode 100644 index 0000000..858511d --- /dev/null +++ b/skills/token-saving/evals/files/rate-limiter-fix.diff @@ -0,0 +1,28 @@ +diff --git a/src/middleware/rate_limiter.py b/src/middleware/rate_limiter.py +index 3a1f2c8..b9e4d71 100644 +--- a/src/middleware/rate_limiter.py ++++ b/src/middleware/rate_limiter.py +@@ -1,25 +1,25 @@ + import time + from collections import defaultdict + from typing import Dict, List + + + class RateLimiter: + """Sliding window rate limiter keyed by client_id.""" + + def __init__(self, max_requests: int = 100, window_seconds: int = 60): + self.max_requests = max_requests + self.window_seconds = window_seconds + self.requests: Dict[str, List[float]] = defaultdict(list) + + def is_allowed(self, client_id: str) -> bool: + now = time.time() + window_start = now - self.window_seconds +- self.requests[client_id] = [t for t in self.requests[client_id] if t > now] ++ self.requests[client_id] = [t for t in self.requests[client_id] if t > window_start] + self.requests[client_id].append(now) + return len(self.requests[client_id]) <= self.max_requests + + def reset(self, client_id: str) -> None: + self.requests.pop(client_id, None) diff --git a/skills/token-saving/evals/files/sprint-changelog.txt b/skills/token-saving/evals/files/sprint-changelog.txt new file mode 100644 index 0000000..6992ae4 --- /dev/null +++ b/skills/token-saving/evals/files/sprint-changelog.txt @@ -0,0 +1,36 @@ +# Sprint 23 — User Service Changelog +# Period: 2026-04-28 to 2026-05-09 + +## AUTH-201 — Password reset flow +- Added POST /auth/password-reset/request endpoint; sends reset token via email +- Added POST /auth/password-reset/confirm endpoint; validates token, sets new password +- Reset tokens expire after 1 hour; stored hashed in users.reset_token column +- Migration: adds reset_token (varchar 64, nullable) and reset_token_expires (timestamptz, nullable) to users table +- Unit tests: 8 new cases covering happy path, expired token, invalid token, reuse prevention + +## USER-88 — Profile update endpoint +- Added PATCH /users/{id}/profile; supports display_name and email changes +- Email change triggers re-verification flow; sets email_verified = false until confirmed +- Duplicate email check added before persisting change +- Integration test: 5 new cases + +## USER-91 — Soft delete / deactivate +- Added DELETE /users/{id} (soft delete); sets is_active = false, records deactivated_at +- GET /users now filters out inactive users by default; added ?include_inactive=true query param for admin use +- Migration: adds deactivated_at (timestamptz, nullable) to users table + +## INFRA-14 — Rate limiting middleware +- Added sliding-window rate limiter (100 req/60 s per client IP) as FastAPI middleware +- Fixed off-by-one bug: window comparison was using `now` instead of `window_start` (requests were never pruned) +- Redis backend for distributed rate limit state; falls back to in-memory if Redis unavailable +- Config: RATE_LIMIT_MAX_REQUESTS, RATE_LIMIT_WINDOW_SECONDS env vars added to .env.example + +## CHORE-09 — Dependency bumps +- Upgraded fastapi 0.109 → 0.111 (security patch: CVE-2024-24762) +- Upgraded pydantic 2.5 → 2.7 (minor; no breaking changes) +- Upgraded alembic 1.13 → 1.14 (minor; added index reflection improvements) + +## TEST-22 — Coverage improvements +- Overall coverage: 61% → 78% +- Added missing edge-case tests for validate_email (empty string, unicode domains, subaddressing) +- Added missing edge-case tests for validate_password (all-digits, all-uppercase, exactly min length) diff --git a/skills/token-saving/evals/trigger-eval.json b/skills/token-saving/evals/trigger-eval.json new file mode 100644 index 0000000..f739903 --- /dev/null +++ b/skills/token-saving/evals/trigger-eval.json @@ -0,0 +1,107 @@ +[ + { + "_comment": "token-saving is always-active: should_trigger is true for every prompt type. There are no should_trigger:false cases — that is itself the key assertion. Cases are grouped into: (1) keyword-free prompts (tests that 'always active' is not just keyword matching), (2) explicit-keyword prompts (tests that the description's trigger list is recognised), (3) boundary prompts (explicit verbosity request — skill still LOADS but its rules are suspended per the override section). The trigger-eval passes when ALL 14 cases return true." + }, + { + "id": "t01-generic-coding-no-keyword", + "query": "What does the `__slots__` attribute do in Python?", + "should_trigger": true, + "reason": "Generic coding question with zero token-saving keywords. Always-active skill must load regardless of prompt content." + }, + { + "id": "t02-code-generation-no-keyword", + "query": "Write a Python function that flattens a nested list.", + "should_trigger": true, + "reason": "Code generation task. No conciseness keywords present. Always-active rule applies." + }, + { + "id": "t03-debugging-no-keyword", + "query": "Why does my Dockerfile build succeed but the container exits immediately at startup?", + "should_trigger": true, + "reason": "Debugging question with no skill keywords. Validates always-active claim holds for question-style prompts." + }, + { + "id": "t04-conceptual-explanation-no-keyword", + "query": "Explain eventual consistency in distributed systems.", + "should_trigger": true, + "reason": "Conceptual explanation request. No trigger keywords. Tests that the skill doesn't require topic-match to load." + }, + { + "id": "t05-non-technical-no-keyword", + "query": "What is the difference between a kanban board and a sprint board?", + "should_trigger": true, + "reason": "Non-technical process question. Always-active means the skill loads even outside software engineering topics." + }, + { + "id": "t06-pr-description-update-no-keyword", + "query": "Update the PR description to mention the hotfix for the null pointer in checkout.", + "should_trigger": true, + "reason": "PR workflow task. No explicit conciseness language. Tests always-active across workflow tasks." + }, + { + "id": "t07-recap-request-no-keyword", + "query": "Summarise the changes we made to the auth service today.", + "should_trigger": true, + "reason": "'summarise' appears in the description trigger list. But the primary signal is always-active. Confirms overlap between trigger list and always-active rule." + }, + { + "id": "t08-explicit-be-concise", + "query": "Give me a concise explanation of how TLS handshakes work.", + "should_trigger": true, + "reason": "'concise' is an explicit trigger keyword in the description. Skill must load and apply length constraints." + }, + { + "id": "t09-explicit-save-tokens", + "query": "Keep the response short — save tokens where you can.", + "should_trigger": true, + "reason": "'save tokens' and 'short' are explicit trigger phrases. Tests that the description's keyword list is matched." + }, + { + "id": "t10-explicit-too-verbose", + "query": "Your last answer was too verbose. Give me the same thing but shorter.", + "should_trigger": true, + "reason": "'too verbose' and 'shorter' are both explicit triggers in the description. Should reliably load the skill." + }, + { + "id": "t11-explicit-shorter", + "query": "Can you make that shorter?", + "should_trigger": true, + "reason": "'shorter' is an explicit trigger keyword. Minimal prompt — tests that a single keyword is sufficient." + }, + { + "id": "b01-explicit-full-detail-boundary", + "query": "Give me a full in-depth explanation of OAuth 2.0 Authorization Code Flow — I want every step.", + "should_trigger": true, + "boundary": true, + "reason": "Explicit verbosity request. Skill should still LOAD (trigger=true), but the override section suspends all length and summarisation rules. Trigger result is true; behavioural result is unconstrained response.", + "expected_behaviour": "Skill loads. All length limits and bullet caps are suspended. Full response with every step is produced." + }, + { + "id": "b02-deep-dive-boundary", + "query": "I want a deep dive — don't hold back on detail.", + "should_trigger": true, + "boundary": true, + "reason": "'deep dive' matches the override trigger in the skill body. Skill loads but rules are suspended.", + "expected_behaviour": "Skill loads. No artificial truncation. Response length reflects actual content depth." + }, + { + "id": "b03-rationale-boundary", + "query": "Walk me through the full rationale for choosing event sourcing over CRUD for this audit log.", + "should_trigger": true, + "boundary": true, + "reason": "'full rationale' matches the override trigger. Skill loads but all conciseness rules are suspended for this response.", + "expected_behaviour": "Skill loads. Agent covers all architectural tradeoffs without summarising or capping bullets." + }, + { + "id": "t12-co-trigger-pr-review", + "query": "Review this PR for API contract breaking changes before we merge.", + "should_trigger": true, + "reason": "Token-saving is always-active and must co-load alongside the pr-review skill. pr-review output structure (Blocker/Important/Nit) takes precedence per the skill's precedence rule, but token-saving still loads to suppress filler openers and closing padding." + }, + { + "id": "t13-co-trigger-kudos", + "query": "Nominate Sarah for kudos — she fixed a critical auth bug under pressure.", + "should_trigger": true, + "reason": "Token-saving is always-active and must co-load alongside the kudos skill. Kudos nomination format takes precedence, but token-saving still loads to enforce no-preamble and no-filler rules." + } +]