diff --git a/.opencode/agents/developer.md b/.opencode/agents/developer.md
index b7f157f..0a1007f 100644
--- a/.opencode/agents/developer.md
+++ b/.opencode/agents/developer.md
@@ -53,7 +53,8 @@ When a new feature is ready in `docs/features/backlog/`:
      Alternatives considered: <what you rejected and why>
      ```
    - Build changes that need PO approval: new runtime deps, new packages, changed entry points
-4. If build changes need PO approval, ask before proceeding. Tooling changes (coverage, lint rules, test config) are your autonomy.
+4. **Architecture contradiction check**: After writing the Architecture section, compare each ADR against each AC. If any architectural decision contradicts or circumvents an acceptance criterion, flag it and resolve with the PO before writing any production code.
+5. If build changes need PO approval, ask before proceeding. Tooling changes (coverage, lint rules, test config) are your autonomy.
 5. Update `pyproject.toml` and project structure as needed.
 6. Run `uv run task test` — must still pass.
 7. Commit: `feat(bootstrap): configure build for <feature-name>`
@@ -67,6 +68,7 @@ Load `skill implementation`. Make tests green one at a time.
 Commit after each test goes green: `feat(<feature-name>): implement <component>`
 Self-verify after each commit: run all four commands in the Self-Verification block below.
 If you discover a missing behavior during implementation, load `skill extend-criteria`.
+Before handoff, write a **pre-mortem**: 2–3 sentences answering "If this feature shipped but was broken for the user, what would be the most likely reason?" Include it in the handoff message or as a `## Pre-mortem` subsection in the feature doc's Architecture section.
 
 ### After reviewer approves (Step 5)
 Load `skill pr-management` and `skill git-release` as needed.
@@ -87,6 +89,15 @@ Load `skill pr-management` and `skill git-release` as needed.
    7. Keep all entities small (functions ≤20 lines, classes ≤50 lines)
    8. No more than 2 instance variables per class
    9. No getters/setters (tell, don't ask)
+6. **Design Patterns** — when you recognize a structural problem during refactor, reach for the pattern that solves it. Not preemptively (YAGNI applies). The trigger is the structural problem, not the pattern.
+
+   | Structural problem | Pattern to consider |
+   |---|---|
+   | Multiple if/elif on type or state | State or Strategy |
+   | Complex construction logic in `__init__` | Factory or Builder |
+   | Multiple components, callers must know each one | Facade |
+   | External dependency (I/O, DB, network) | Repository/Adapter via Protocol |
+   | Decoupled event-driven producers/consumers | Observer or pub/sub |
 
 ## Architecture Ownership
 
@@ -113,6 +124,8 @@ uv run task test                # must exit 0, all tests pass
 timeout 10s uv run task run     # must exit non-124; exit 124 = timeout (infinite loop) = fix it
 ```
 
+After all four commands pass, run the app and **manually verify** it does what the AC says, not just what the tests check. If the feature involves user interaction, interact with it yourself.
+
 Do not hand off broken work to the reviewer.
 
 ## Project Structure Convention
diff --git a/.opencode/agents/product-owner.md b/.opencode/agents/product-owner.md
index 6a19821..f9ec531 100644
--- a/.opencode/agents/product-owner.md
+++ b/.opencode/agents/product-owner.md
@@ -32,10 +32,17 @@ Every session: load `skill session-workflow` first.
 
 ### Step 1 — SCOPE
 Load `skill scope`. Define user stories and acceptance criteria for a feature.
+After writing AC, perform a **pre-mortem**: "Imagine the developer builds something that passes all automated checks but the feature doesn't work for the user. What would be missing?" Add any discoveries as additional AC before committing.
 Commit: `feat(scope): define <feature-name> acceptance criteria`
 
+### Step 2 — ARCHITECTURE REVIEW (your gate)
+When the developer proposes the Architecture section (ADRs), review it:
+- Does any ADR contradict an acceptance criterion? If so, reject and ask the developer to resolve before proceeding.
+- Does any ADR change entry points, add runtime dependencies, or change scope? Approve or reject explicitly.
+
 ### Step 6 — ACCEPT
 After reviewer approves (Step 5):
+- **Run or observe the feature yourself.** Don't rely solely on automated check results. If the feature involves user interaction, interact with it. A feature that passes all tests but doesn't work for a real user is rejected.
 - Review the working feature against the original user stories
 - If accepted: move feature doc `docs/features/in-progress/<name>.md` → `docs/features/completed/<name>.md`
 - Update TODO.md: no feature in progress
diff --git a/.opencode/agents/reviewer.md b/.opencode/agents/reviewer.md
index 0038dfe..e6f3463 100644
--- a/.opencode/agents/reviewer.md
+++ b/.opencode/agents/reviewer.md
@@ -29,6 +29,8 @@ permissions:
 
 You verify that the work is done correctly by running commands and reading code. You do not write or edit files.
 
+**Your default hypothesis is that the code is broken despite passing automated checks. Your job is to find the failure mode. If you cannot find one after thorough investigation, APPROVE. If you find one, REJECTED.**
+
 ## Responsibilities
 
 - Run every verification command and report actual output
@@ -50,60 +52,105 @@ Load `skill verify`. Run all commands, check all criteria, produce a written rep
 - **Never suggest noqa, type: ignore, or pytest.skip as a fix.** These are bypasses, not solutions.
 - **Report specific locations.** "Line 47 of physics/engine.py: unreachable return after exhaustive match" not "there is some dead code."
 
-## Verification Checklist
-
-Run these in order. If any fails, stop and report — do not continue to the next:
+## Verification Order
+
+1. **Read feature doc** — UUIDs, interaction model, developer pre-mortem
+2. **Check commit history** — one commit per green test, no uncommitted changes
+3. **Run the app** — production-grade gate (see below)
+4. **Code review** — read source files, fill all tables
+5. **Run commands** — lint, static-check, test (stop on first failure)
+6. **Interactive verification** — if feature involves user interaction
+7. **Write report**
+
+**Do code review before running lint/static-check/test.** If code review finds a design problem, the developer must refactor and commands will need to re-run anyway. Do the hard cognitive work first.
+
+## Production-Grade Gate (Step 3)
+
+Run before code review. If any row is FAIL → REJECTED immediately.
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| Developer declared production-grade | Read feature doc pre-mortem or handoff message | Explicit statement present | Absent or says "demo" or "incomplete" | Developer must complete the implementation |
+| App exits cleanly | `timeout 10s uv run task run` | Exit 0 or non-124 | Exit 124 (timeout/hang) | Developer must fix the hang |
+| Output changes when input changes | Run app, change an input or condition, observe output | Output changes accordingly | Output is static regardless of input | Developer must implement real logic — output that does not change with input is not complete |
+
+## Code Review (Step 4)
+
+**Correctness** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| No dead code | Read for unreachable statements, unused variables, impossible branches | None found | Any found | Remove or fix the unreachable path |
+| No duplicate logic (DRY) | Search for repeated blocks doing the same thing | None found | Duplication found | Extract to shared function |
+| No over-engineering (YAGNI) | Check for abstractions with no current use | None found | Unused abstraction or premature generalization | Remove unused code |
+
+**Simplicity (KISS)** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| Functions do one thing | Read each function; can you describe it without `and`? | Yes | No | Split into focused functions |
+| Nesting ≤ 2 levels | Count indent levels in each function | ≤ 2 | > 2 | Extract inner block to helper |
+| Functions ≤ 20 lines | Count lines | ≤ 20 | > 20 | Extract helper |
+| Classes ≤ 50 lines | Count lines | ≤ 50 | > 50 | Split class |
+
+**SOLID** — any FAIL → REJECTED:
+
+| Principle | Why it matters | What to check | How to check | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|---|
+| SRP | Multiple change-reasons accumulate bugs at every change site | Each class/function has one reason to change | Count distinct concerns; each `and` in its description = warning sign | | |
+| OCP | Modifying existing code for new behavior invalidates existing tests | New behavior via extension, not modification | Check if adding the new case required editing existing class bodies | | |
+| LSP | Substitution failures cause silent runtime errors tests miss | Subtypes behave identically to base type at all call sites | Check if any subtype narrows a contract or raises where base does not | | |
+| ISP | Fat interfaces force implementors to have methods they cannot meaningfully implement | No Protocol/ABC forces unused method implementations | Check if any implementor raises `NotImplementedError` or passes on inherited methods | | |
+| DIP | Depending on concrete I/O makes unit testing impossible | High-level modules depend on abstractions (Protocols) | Check if any domain class imports from I/O, DB, or framework layers directly | | |
+
+**Object Calisthenics** — any FAIL → REJECTED:
+
+| # | Rule | Why it matters | How to check | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|---|
+| 1 | One indent level per method | Reduces cognitive load per function | Count max nesting in source | | |
+| 2 | No `else` after `return` | Eliminates hidden control flow paths | Search for `else` inside functions with early returns | | |
+| 3 | Primitives wrapped | Prevents primitive obsession; enables validation at construction | Bare `int`/`str` in domain signatures = FAIL | | |
+| 4 | Collections wrapped in classes | Encapsulates iteration and filtering logic | `list[X]` as domain value = FAIL | | |
+| 5 | One dot per line | Reduces coupling to transitive dependencies | `a.b.c()` chains = FAIL | | |
+| 6 | No abbreviations | Names are documentation; abbreviations lose meaning | `mgr`, `tmp`, `calc` = FAIL | | |
+| 7 | Small entities | Smaller units are easier to test, read, and replace | Functions > 20 lines or classes > 50 lines = FAIL | | |
+| 8 | ≤ 2 instance variables | Forces single responsibility through structural constraint | Count `self.x` assignments in `__init__` | | |
+| 9 | No getters/setters | Enforces tell-don't-ask; behavior lives with data | `get_x()`/`set_x()` pairs = FAIL | | |
+
+**Design Patterns** — any FAIL → REJECTED:
+
+| Code smell | Pattern missed | Why it matters | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|
+| Multiple if/elif on type/state | State or Strategy | Eliminates conditional complexity, makes adding new states safe | | |
+| Complex `__init__` with side effects | Factory or Builder | Separates construction from use, enables testing | | |
+| Callers must know multiple internal components | Facade | Single entry point reduces coupling | | |
+| External dep without Protocol | Repository/Adapter | Enables testing without real I/O; enforces DIP | | |
+| 0 domain classes, many functions | Missing domain model | Procedural code has no encapsulation boundary | | |
+
+**Tests** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| UUID docstring format | Read first line of each docstring | UUID only, blank line, Given/When/Then | Description on UUID line | Remove description; UUID line must be bare |
+| Contract test | Would this test survive a full internal rewrite? | Yes | No | Rewrite assertion to test observable output, not internals |
+| No internal attribute access | Search for `_x` in assertions | None found | `_x`, `isinstance`, `type()` found | Replace with public API assertion |
+| Every AC has a mapped test | `grep -r "<uuid>" tests/` per UUID | Found | Not found | Write the missing test |
+| No UUID used twice | See command below — empty = PASS | Empty output | UUID printed | If only `Given` differs: consolidate into Hypothesis `@given` + `@example`. If `When`/`Then` differs: use `extend-criteria` |
 
 ```bash
-uv run task lint                # must exit 0
-uv run task static-check        # must exit 0, 0 errors
-uv run task test                # must exit 0, 0 failures, coverage >= 100%
-timeout 10s uv run task run     # must exit non-124; exit 124 = timeout (infinite loop) = FAIL
+# UUID Drift check — any output = FAIL
+grep -rh --include='*.py' '[0-9a-f]\{8\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{12\}' tests/ \
+  | grep -oE '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}' \
+  | sort | uniq -d
 ```
 
-## Code Review Checklist
-
-After all commands pass, review source code for:
-
-**Correctness**
-- [ ] No dead code (unreachable statements, unused variables, impossible branches)
-- [ ] No duplicate logic (DRY)
-- [ ] No over-engineering (YAGNI — no unused abstractions, no premature generalization)
-
-**Simplicity (KISS)**
-- [ ] Functions do one thing
-- [ ] No nesting deeper than 2 levels
-- [ ] No function longer than 20 lines
-- [ ] No class longer than 50 lines
-
-**SOLID**
-- [ ] Single responsibility per class/function
-- [ ] Open/closed: extend without modifying existing code
-- [ ] Liskov: subtypes behave as their base types
-- [ ] Interface segregation: no fat interfaces
-- [ ] Dependency inversion: depend on abstractions
-
-**Object Calisthenics** (enforce all 9)
-1. One level of indentation per method
-2. No `else` after `return`
-3. Wrap all primitives and strings (use value objects for domain concepts)
-4. First-class collections (wrap collections in classes)
-5. One dot per line (no chaining)
-6. No abbreviations in names
-7. Keep all entities small
-8. No classes with more than 2 instance variables
-9. No getters/setters (tell, don't ask)
-
-**Tests**
-- [ ] Every test has UUID-only first line docstring, blank line, then Given/When/Then
-- [ ] Tests assert behavior, not structure
-- [ ] Every acceptance criterion has a mapped test
-- [ ] No test verifies isinstance, type(), or internal attributes
-
-**Versions and Build**
-- [ ] `pyproject.toml` version matches `__version__` in package
-- [ ] Coverage target (`--cov=<package>`) matches actual package name
-- [ ] All declared packages exist in the codebase
+**Versions and Build** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| `pyproject.toml` version matches `__version__` | Read both files | Match | Mismatch | Align the version strings |
+| Coverage target matches package | Check `--cov=<package>` in test config | Matches actual package | Wrong package name | Fix the `--cov` argument |
+| All declared packages exist | Check `[tool.setuptools] packages` against filesystem | All present | Missing package | Add the missing directory or remove the declaration |
 
 ## Report Format
 
diff --git a/.opencode/skills/code-quality/SKILL.md b/.opencode/skills/code-quality/SKILL.md
index 8474135..ffee39f 100644
--- a/.opencode/skills/code-quality/SKILL.md
+++ b/.opencode/skills/code-quality/SKILL.md
@@ -134,6 +134,31 @@ Required on all public functions and classes. Not required on private helpers (`
 
 If a function exceeds the limit, extract sub-functions. If a class exceeds the limit, split responsibilities.
 
+## Structural Quality Checks
+
+`lint`, `static-check`, and `test` verify **syntax-level** quality. They do NOT verify **design-level** quality (nesting depth, function length, value objects, design patterns). Both must pass.
+
+Run through this table during refactor and before handoff:
+
+| If you see... | Then you must... |
+|---|---|
+| Function > 20 lines | Extract helper |
+| Nesting > 2 levels | Extract to function |
+| Bare `int`/`str` as domain concept | Wrap in value object |
+| > 4 positional parameters | Group into dataclass |
+| `list[X]` as domain collection | Wrap in collection class |
+| No classes in domain code | Introduce domain classes |
+
+## Design Anti-Pattern Recognition
+
+| Code smell | Indicates | Fix |
+|---|---|---|
+| 15+ functions, 0 classes | Procedural code disguised as modules | Introduce domain classes |
+| 8+ parameters on a function | Missing abstraction | Group into dataclass/value object |
+| Type alias (`X = int`) instead of value object | Primitive obsession | Wrap in frozen dataclass |
+| 3+ nesting levels | Missing extraction | Extract to helper functions |
+| `get_x()` / `set_x()` pairs | Anemic domain model | Replace with commands and queries |
+
 ## Pre-Handoff Checklist
 
 - [ ] `task lint` exits 0, no auto-fixes needed
diff --git a/.opencode/skills/extend-criteria/SKILL.md b/.opencode/skills/extend-criteria/SKILL.md
index 8bf6e06..83f453a 100644
--- a/.opencode/skills/extend-criteria/SKILL.md
+++ b/.opencode/skills/extend-criteria/SKILL.md
@@ -29,6 +29,7 @@ Ask: "Does this behavior fall within the intent of the current user stories?"
 | New observable behavior users did not ask for | Escalate to PO; do not add criterion unilaterally |
 | Post-merge regression (the feature was accepted and broke later) | Reopen feature doc; add criterion with `Source: bug` |
 | Behavior already present but criterion was never written | Add criterion with appropriate `Source:` |
+| **Architecture decision contradicts an acceptance criterion** | **Escalate to PO immediately. Do not proceed with implementation.** |
 
 When in doubt, ask the PO before adding.
 
diff --git a/.opencode/skills/implementation/SKILL.md b/.opencode/skills/implementation/SKILL.md
index 833af9d..b6262b7 100644
--- a/.opencode/skills/implementation/SKILL.md
+++ b/.opencode/skills/implementation/SKILL.md
@@ -30,7 +30,7 @@ Never write production code before picking a specific failing test. Never refact
 2. Work outward: state machines, I/O, orchestration
 3. Follow the order of acceptance criteria in the feature doc
 
-## Architecture Section (do this first)
+## Architecture Section (do this first, then verify against AC)
 
 Before writing any production code, add `## Architecture` to `docs/features/in-progress/<name>.md`:
 
@@ -56,6 +56,8 @@ Alternatives considered: <what was rejected and why>
 
 If any build changes need PO approval, stop and ask before proceeding.
 
+**Architecture contradiction check**: After writing the Architecture section, compare each ADR against each AC. If any architectural decision contradicts or circumvents an acceptance criterion (e.g., "demo-first" vs. "when the user presses W"), flag it and resolve with the PO before writing any production code. This is not optional.
+
 ## Signature Design
 
 Design signatures before writing bodies. Use Python protocols for abstractions:
@@ -137,6 +139,31 @@ uv run task test                                      # must all still pass
 4. **Type hints**: add/fix type annotations on all public functions and classes
 5. **Docstrings**: Google-style on all public functions and classes
 
+### Refactor Self-Check Gates
+
+After refactor, before committing, run through this table. Each row is a mandatory check:
+
+| If you see... | Then you must... | Before committing |
+|---|---|---|
+| Function > 20 lines | Extract helper | Verify line count |
+| Nesting > 2 levels | Extract to function | Verify max depth |
+| Bare `int`/`str` as domain concept | Wrap in value object | Verify no raw primitives in signatures |
+| > 4 positional parameters | Group into dataclass | Verify parameter count |
+| `list[X]` as domain collection | Wrap in collection class | Verify no bare lists |
+| No classes in domain code | Reconsider — are you writing procedural code? | Verify at least one domain class exists |
+
+### Design Pattern Decision Table
+
+Not "use patterns everywhere" — use when a pattern solves a structural problem you already have:
+
+| If your code has... | Consider... | Why |
+|---|---|---|
+| Multiple `if/elif` branches on type/state | State or Strategy pattern | Eliminates conditional complexity |
+| Constructor that does complex setup | Factory or Builder | Separates construction from use |
+| Multiple components that must work together | Facade | Single entry point reduces coupling |
+| External dependency (I/O, DB, network) | Repository/Adapter pattern | Enables testing via Protocol |
+| Event-driven flow | Observer or pub/sub | Decouples producers from consumers |
+
 > **Note**: `uv run task test` runs `--doctest-modules`, which executes code examples embedded in source docstrings. Keep `Examples:` blocks in Google-style docstrings valid and executable. If an example should not be run, mark it with `# doctest: +SKIP`.
 
 ```bash
@@ -166,3 +193,9 @@ timeout 10s uv run task run     # exit non-124; exit 124 = hung process = fix it
 ```
 
 All four must pass. Do not hand off broken work.
+
+**Manual verification**: After all four commands pass, run the app and manually verify it does what the AC says, not just what the tests check. If the feature involves user interaction, interact with it yourself.
+
+**Production-grade check**: Before handing off, answer honestly: if you change an input, does the output change accordingly? If any output is static regardless of input, the implementation is not complete — fix it before handing off. The reviewer will verify this by running the app and changing an input.
+
+**Developer pre-mortem** (write this before handing off to reviewer): In 2–3 sentences, answer: "If this feature shipped but was broken for the user, what would be the most likely reason?" Include this in the handoff message or as a `## Pre-mortem` subsection in the feature doc's Architecture section.
diff --git a/.opencode/skills/scope/SKILL.md b/.opencode/skills/scope/SKILL.md
index afc439d..ce2a70e 100644
--- a/.opencode/skills/scope/SKILL.md
+++ b/.opencode/skills/scope/SKILL.md
@@ -87,6 +87,9 @@ python -c "import uuid; print(uuid.uuid4())"
 - `Source:` on the next line, followed by a blank line, then Given/When/Then
 - Use plain English, not technical jargon in Given/When/Then
 - "Then" must be a single observable, measurable outcome — no "and"
+- **Observable means observable by the end user, not by a test harness.** If the AC says "when the user presses W," a test that calls `update_player("W")` does not satisfy it. Either (a) the test must send input through the actual user-facing entry point, or (b) the AC must explicitly state the boundary ("when `update_player` receives 'W'") so the gap is visible.
+
+**Interaction model declaration**: If the feature involves user interaction (CLI input, web forms, API calls), the Notes section must declare the interaction model: what input the user provides and how. This prevents a hardcoded demo from silently substituting for real interaction.
 
 **Common mistakes to avoid**:
 - "Then: It works correctly" (not measurable)
@@ -104,9 +107,12 @@ Before committing:
 - [ ] Every criterion has a `Source:` field with one of the five valid values
 - [ ] Every criterion has Given/When/Then
 - [ ] Blank line between `Source:` and `Given:`
-- [ ] "Then" is a single, observable, measurable outcome
+- [ ] "Then" is a single, observable, measurable outcome — observable by the end user
 - [ ] No criterion tests implementation details
 - [ ] Out-of-scope items are explicitly listed in Notes
+- [ ] If the feature involves user interaction, the Notes section declares the interaction model
+
+**PO pre-mortem** (do this before committing): Imagine the developer builds exactly what the AC says, all automated tests pass, but the feature doesn't work for the user. What would be missing? Add any discoveries as additional acceptance criteria.
 
 ### 5. Commit and Notify Developer
 
diff --git a/.opencode/skills/tdd/SKILL.md b/.opencode/skills/tdd/SKILL.md
index e493814..20113b4 100644
--- a/.opencode/skills/tdd/SKILL.md
+++ b/.opencode/skills/tdd/SKILL.md
@@ -53,13 +53,35 @@ def test_ball_bounces_off_top_wall():
     Then: The ball velocity y-component becomes positive
     """
     # Given
-    ...
+    ball = Ball(x=5, y=0, vy=-1)
     # When
-    ...
+    result = physics.update(ball)
+    # Then
+    assert result.vy > 0  # Asserts observable behavior
+```
+
+**A test that looks correct but is wrong:**
+
+```python
+def test_ball_bounces_off_top_wall():
+    """a1b2c3d4-e5f6-7890-abcd-ef1234567890
+
+    Given: A ball moving upward reaches y=0
+    When: The physics engine processes the next frame
+    Then: The ball velocity y-component becomes positive
+    """
+    # Given
+    ball = Ball(x=5, y=0, vy=-1)
+    # When
+    physics.update(ball)
     # Then
-    assert ...
+    assert ball._velocity_y > 0  # WRONG: tests internal attribute, not observable behavior
+    # This test would break if you rename _velocity_y, even though behavior is unchanged.
 ```
 
+The correct test (`result.vy > 0`) would still pass after a complete rewrite that preserves behavior.
+The wrong test (`ball._velocity_y > 0`) would break if you rename the internal field.
+
 **Rules**:
 - First line: `<uuid>` only — no description
 - Mandatory blank line between UUID and Given
@@ -125,6 +147,16 @@ def test_compute_distance_always_non_negative(x: float) -> None:
     assert result >= 0
 ```
 
+### Meaningful vs. Tautological Property Tests
+
+A meaningful property test asserts an **invariant** — something that must be true regardless of input, that is NOT derived from the implementation itself:
+
+| Tautological (useless) | Meaningful (tests the contract) |
+|---|---|
+| `assert Score(x).value == x` | `assert Score(x).value >= 0` |
+| `assert sorted(list) == sorted(list)` | `assert sorted(list) == sorted(list, key=...)` |
+| `assert EmailAddress(valid).value == valid` | `assert "@" in EmailAddress(valid).value` |
+
 ## Writing Failing Tests (Step 3 Checklist)
 
 1. For each UUID in the feature doc, create one test function
@@ -133,9 +165,42 @@ def test_compute_distance_always_non_negative(x: float) -> None:
 4. Run `pytest` — confirm every new test fails with `ImportError` or `AttributeError`, not a logic failure
 5. Commit: `test(<feature-name>): add failing tests for all acceptance criteria`
 
+## Integration Test Requirement
+
+For any feature with multiple components or user interaction, at least one `@pytest.mark.integration` test must exercise the public entry point with realistic input. This test must NOT call internal helpers directly — it must go through the same path a real user would.
+
+## Semantic Alignment Rule
+
+The test's Given/When/Then must operate at the **same abstraction level** as the AC's Given/When/Then.
+
+| AC says | Test must do |
+|---|---|
+| "When the user presses W" | Send `"W"` through the actual input mechanism (stdin, key event, CLI arg) |
+| "When `update_player` receives 'W'" | Call `update_player("W")` directly — the boundary is explicit |
+
+If testing through the real entry point is infeasible, the developer must add a new AC via `skill extend-criteria` that explicitly describes the lower-level boundary. **Never silently shift abstraction levels.**
+
+## UUID Uniqueness
+
+One test function per UUID. One UUID per test function. If only `Given` varies across cases, that is a property by definition — use Hypothesis `@given` with `@example` for known edge cases. If `When` or `Then` would differ, that is a new criterion: use `extend-criteria`.
+
+## Property-Based Testing Decision Rule
+
+Use Hypothesis when a function has invariants — things that must always be true regardless of input:
+
+| Code has... | Write a property test for... |
+|---|---|
+| Ball position clamped to screen | Position always stays within bounds |
+| Score accumulator | Score never goes negative |
+| Sorted collection | Order is always preserved after insertion |
+| Domain value object | Constructor always rejects invalid inputs |
+
+This catches "technically passes but doesn't work" failures that deterministic tests miss.
+
 ## Quality Rules
 
-- Assert behavior, not structure: no `isinstance()`, `type()`, or internal attribute checks
+- Write every test as if you cannot see the production code. The test describes what a caller observes, not how the code achieves it. If a complete internal rewrite would not break this test (while preserving behavior), it is correctly written.
+- No `isinstance()`, `type()`, or internal attribute (`_x`) checks in assertions
 - One assertion concept per test (multiple `assert` statements are ok if they verify the same thing)
 - No `pytest.skip` or `pytest.mark.xfail` without written justification in the docstring
 - Never use `noqa` — fix the underlying issue instead
diff --git a/.opencode/skills/verify/SKILL.md b/.opencode/skills/verify/SKILL.md
index ed36319..3b055d2 100644
--- a/.opencode/skills/verify/SKILL.md
+++ b/.opencode/skills/verify/SKILL.md
@@ -11,6 +11,8 @@ workflow: feature-lifecycle
 
 This skill guides the reviewer through Step 5: independent verification that the feature works correctly and meets quality standards. The output is a written report with a clear APPROVED or REJECTED decision.
 
+**Your default hypothesis is that the code is broken despite passing automated checks. Your job is to find the failure mode. If you cannot find one after thorough investigation, APPROVE. If you find one, REJECTED.**
+
 ## When to Use
 
 After the developer signals Step 4 is complete. Do not start verification until the developer has committed all work.
@@ -21,11 +23,14 @@ After the developer signals Step 4 is complete. Do not start verification until
 
 Read `docs/features/in-progress/<feature-name>.md`. Extract:
 - All UUIDs and their descriptions
+- The interaction model from Notes (if the feature involves user interaction)
+- The developer's pre-mortem (if present in the Architecture section)
 
 ### 2. Check Commit History
 
 ```bash
 git log --oneline -20
+git status
 ```
 
 Verify:
@@ -33,101 +38,130 @@ Verify:
 - Every step has a commit (`bootstrap`, `failing tests`, per-feature-name commits)
 - No uncommitted changes: `git status` should be clean
 
-### 3. Run Verification Commands (in order)
-
-Run each command. Record the exact exit code and output summary.
+### 3. Production-Grade Gate
+
+Run before code review. If any row is FAIL → REJECTED immediately.
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| Developer declared production-grade | Read feature doc pre-mortem or handoff message | Explicit statement present | Absent or says "demo" or "incomplete" | Developer must complete the implementation |
+| App exits cleanly | `timeout 10s uv run task run` | Exit 0 or non-124 | Exit 124 (timeout/hang) | Developer must fix the hang |
+| Output changes when input changes | Run app, change an input or condition, observe output | Output changes accordingly | Output is static regardless of input | Developer must implement real logic — output that does not change with input is not complete |
+
+### 4. Code Review
+
+Read the source files changed in this feature. **Do this before running lint/static-check/test** — if code review finds a design problem, commands will need to re-run after the fix anyway.
+
+**Correctness** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| No dead code | Read for unreachable statements, unused variables, impossible branches | None found | Any found | Remove or fix the unreachable path |
+| No duplicate logic (DRY) | Search for repeated blocks doing the same thing | None found | Duplication found | Extract to shared function |
+| No over-engineering (YAGNI) | Check for abstractions with no current use | None found | Unused abstraction or premature generalization | Remove unused code |
+
+**Simplicity (KISS)** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| Functions do one thing | Read each function; can you describe it without `and`? | Yes | No | Split into focused functions |
+| Nesting ≤ 2 levels | Count indent levels in each function | ≤ 2 | > 2 | Extract inner block to helper |
+| Functions ≤ 20 lines | Count lines | ≤ 20 | > 20 | Extract helper |
+| Classes ≤ 50 lines | Count lines | ≤ 50 | > 50 | Split class |
+
+**SOLID** — any FAIL → REJECTED:
+
+| Principle | Why it matters | What to check | How to check | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|---|
+| SRP | Multiple change-reasons accumulate bugs at every change site | Each class/function has one reason to change | Count distinct concerns; each `and` in its description = warning sign | | |
+| OCP | Modifying existing code for new behavior invalidates existing tests | New behavior via extension, not modification | Check if adding the new case required editing existing class bodies | | |
+| LSP | Substitution failures cause silent runtime errors tests miss | Subtypes behave identically to base type at all call sites | Check if any subtype narrows a contract or raises where base does not | | |
+| ISP | Fat interfaces force implementors to have methods they cannot meaningfully implement | No Protocol/ABC forces unused method implementations | Check if any implementor raises `NotImplementedError` or passes on inherited methods | | |
+| DIP | Depending on concrete I/O makes unit testing impossible | High-level modules depend on abstractions (Protocols) | Check if any domain class imports from I/O, DB, or framework layers directly | | |
+
+**Object Calisthenics** — any FAIL → REJECTED:
+
+| # | Rule | Why it matters | How to check | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|---|
+| 1 | One indent level per method | Reduces cognitive load per function | Count max nesting in source | | |
+| 2 | No `else` after `return` | Eliminates hidden control flow paths | Search for `else` inside functions with early returns | | |
+| 3 | Primitives wrapped | Prevents primitive obsession; enables validation at construction | Bare `int`/`str` in domain signatures = FAIL | | |
+| 4 | Collections wrapped | Encapsulates iteration and filtering logic | `list[X]` as domain value = FAIL | | |
+| 5 | One dot per line | Reduces coupling to transitive dependencies | `a.b.c()` = FAIL | | |
+| 6 | No abbreviations | Names are documentation; abbreviations lose meaning | `mgr`, `tmp`, `calc` = FAIL | | |
+| 7 | Small entities | Smaller units are easier to test, read, and replace | Functions > 20 lines or classes > 50 lines = FAIL | | |
+| 8 | ≤ 2 instance variables | Forces single responsibility through structural constraint | Count `self.x` assignments in `__init__` | | |
+| 9 | No getters/setters | Enforces tell-don't-ask; behavior lives with data | `get_x()`/`set_x()` pairs = FAIL | | |
+
+**Design Patterns** — any FAIL → REJECTED:
+
+| Code smell | Pattern missed | Why it matters | How to check | PASS/FAIL | Evidence (`file:line`) |
+|---|---|---|---|---|---|
+| Multiple if/elif on type/state | State or Strategy | Eliminates conditional complexity, makes adding new states safe | Search for chains of `isinstance` or string-based dispatch | | |
+| Complex `__init__` with side effects | Factory or Builder | Separates construction from use, enables testing | Check `__init__` line count and side effects | | |
+| Callers must know multiple internal components | Facade | Single entry point reduces coupling | Check how callers interact with the subsystem | | |
+| External dep without Protocol | Repository/Adapter | Enables testing without real I/O; enforces DIP | Check if the dep is injected via abstraction | | |
+| 0 domain classes, many functions | Missing domain model | Procedural code has no encapsulation boundary | Count classes vs functions in domain code | | |
+
+**Tests** — any FAIL → REJECTED:
+
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| UUID docstring format | Read first line of each docstring | UUID only, blank line, Given/When/Then | Description on UUID line | Remove description; UUID line must be bare |
+| Contract test | Would this test survive a full internal rewrite? | Yes | No | Rewrite assertion to test observable output, not internals |
+| No internal attribute access | Search for `_x` in assertions | None found | `_x`, `isinstance`, `type()` found | Replace with public API assertion |
+| Every AC has a mapped test | `grep -r "<uuid>" tests/` per UUID | Found | Not found | Write the missing test |
+| No UUID used twice | See command below — empty = PASS | Empty output | UUID printed | If only `Given` differs: consolidate into Hypothesis `@given` + `@example`. If `When`/`Then` differs: use `extend-criteria` |
 
 ```bash
-uv run task lint
+# UUID Drift check — any output = FAIL
+grep -rh --include='*.py' '[0-9a-f]\{8\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{12\}' tests/ \
+  | grep -oE '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}' \
+  | sort | uniq -d
 ```
-Expected: exit 0, no issues. If ruff makes auto-fixes, that is a FAIL (developer should have run lint before handing off).
 
-```bash
-uv run task static-check
-```
-Expected: exit 0, `0 errors, 0 warnings` from pyright.
+**Versions and Build** — any FAIL → REJECTED:
 
-```bash
-uv run task test
-```
-Expected: exit 0, all tests pass, coverage ≥ 100%.
+| Check | How to check | PASS | FAIL | Fix |
+|---|---|---|---|---|
+| `pyproject.toml` version matches `__version__` | Read both files | Match | Mismatch | Align the version strings |
+| Coverage target matches package | Check `--cov=<package>` in test config | Matches actual package | Wrong package name | Fix the `--cov` argument |
+| All declared packages exist | Check `[tool.setuptools] packages` against filesystem | All present | Missing package | Add the missing directory or remove the declaration |
+| No `noqa` comments | `grep -r "noqa" src/` | None found | Any found | Fix the underlying issue |
+| No `type: ignore` comments | `grep -r "type: ignore" src/` | None found | Any found | Fix the underlying type error |
+
+### 5. Run Verification Commands (in order, stop on first failure)
 
 ```bash
-timeout 10s uv run task run
+uv run task lint
+uv run task static-check
+uv run task test
 ```
-Expected: exit 0 (app completes) or any non-124 exit. **Exit code 124 means the process was killed by timeout — the app hung or is an infinite loop. This is a FAIL.** For interactive/long-running apps, check that startup completes without error before the timeout.
-
-**If any command fails, stop here.** Record the failure and issue a REJECTED report. Do not continue checking.
-
-### 4. UUID Traceability Check
-
-For each acceptance criterion UUID in the feature doc:
-- Find the corresponding test function using `grep -r "<uuid>" tests/`
-- Verify the test function name follows `test_<short_title>`
-- Verify the test docstring contains only the UUID on the first line (no description)
-
-Flag any UUID with no corresponding test as UNCOVERED.
-
-If you identify a missing behavior that has no acceptance criterion, load `skill extend-criteria` to determine whether it is a gap within scope (add criterion with `Source: reviewer`) or a new feature to escalate to the PO.
 
-### 5. Code Review
+Expected for each: exit 0, no errors. Record exact output on failure.
 
-Read the source files changed in this feature. Check:
+### 6. Interactive Verification
 
-**Correctness**
-- No dead code (unreachable statements, unused variables, impossible branches)
-- No duplicate logic
-- No over-engineering (unused abstractions, premature generalization)
+If the feature involves user interaction: run the app, provide real input, verify the output changes in response. An app that produces the same output regardless of input is NOT verified.
 
-**Simplicity (KISS)**
-- Functions do one thing
-- Nesting no deeper than 2 levels
-- Functions ≤ 20 lines
-- Classes ≤ 50 lines
-
-**SOLID**
-- Single responsibility: each class/function has one reason to change
-- Open/closed: new behavior via extension, not modification
-- Liskov: subtypes are usable as their base types
-- Interface segregation: no interface forces implementing unused methods
-- Dependency inversion: high-level modules depend on abstractions
-
-**Object Calisthenics** (check all 9)
-1. One level of indentation per method
-2. No `else` after `return`
-3. Primitives wrapped (domain concepts use value objects, not raw strings/ints)
-4. Collections wrapped in domain classes
-5. One dot per line (no chaining: `a.b.c()`)
-6. No abbreviations (`mgr`, `tmp`, `calc` are violations)
-7. Small entities (see size limits above)
-8. ≤ 2 instance variables per class
-9. No getters/setters; use commands and queries
-
-**Tests**
-- Every test has UUID docstring: `<uuid>` only on the first line, blank line, then Given/When/Then
-- Tests assert behavior, not structure (no `isinstance`, no `type()`, no internal attribute access)
-- `# Given`, `# When`, `# Then` comments in test body
-- No `pytest.skip`, no `pytest.mark.xfail` without explicit justification
-
-**Build Consistency**
-- `pyproject.toml` version matches `<package>/__version__`
-- `--cov=<package>` in test config matches actual package name
-- All packages listed in `[tool.setuptools] packages` exist in the codebase
-- No `noqa` comments
-- No `type: ignore` comments
-
-### 6. Write the Report
+### 7. Write the Report
 
 ```markdown
 ## Step 5 Verification Report — <feature-name>
 
+### Production-Grade Gate
+| Check | Result | Notes |
+|---|---|---|
+| Developer declared production-grade | PASS / FAIL | |
+| App exits cleanly | PASS / FAIL / TIMEOUT | |
+| Output driven by real logic | PASS / FAIL | |
+
 ### Commands
 | Command | Result | Notes |
 |---------|--------|-------|
 | uv run task lint | PASS / FAIL | <details if fail> |
 | uv run task static-check | PASS / FAIL | <errors if fail> |
 | uv run task test | PASS / FAIL | <failures or coverage% if fail> |
-| timeout 10s uv run task run | PASS / FAIL / TIMEOUT | <error or timeout if fail> |
 
 ### UUID Traceability
 | UUID | Description | Test | Status |
@@ -159,3 +193,8 @@ OR
 | Uncovered UUIDs | 0 |
 | `noqa` comments | 0 |
 | `type: ignore` | 0 |
+| Semantic alignment mismatches | 0 |
+| SOLID FAIL rows | 0 |
+| ObjCal FAIL rows | 0 |
+| Design pattern FAIL rows | 0 |
+| Duplicate UUIDs | 0 |
diff --git a/AGENTS.md b/AGENTS.md
index 3f4ba53..b01317c 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -20,6 +20,8 @@ STEP 6: ACCEPT         (product-owner)  → demo, validate, merge, tag
 
 **PO picks the next feature from backlog. Developer never self-selects.**
 
+**Verification is adversarial.** The reviewer's job is to try to break the feature, not to confirm it works. The default hypothesis is "it might be broken despite green checks; prove otherwise."
+
 ## Agents
 
 - **product-owner** — defines scope, acceptance criteria, picks features, accepts deliveries
@@ -124,6 +126,14 @@ Rules:
 - **Class length**: ≤ 50 lines
 - **Max nesting**: 2 levels
 - **Instance variables**: ≤ 2 per class
+- **Semantic alignment**: tests must operate at the same abstraction level as the acceptance criteria they cover. If the AC says "when the user presses W," the test must send W through the actual input mechanism, not call an internal helper.
+- **Integration tests**: multi-component features and features involving user interaction require at least one `@pytest.mark.integration` test that exercises the public entry point.
+
+## Verification Philosophy
+
+- **Automated checks** (lint, typecheck, coverage) verify **syntax-level** correctness — the code is well-formed.
+- **Human review** (semantic alignment, code review, manual testing) verifies **semantic-level** correctness — the code does what the user needs.
+- Both are required. All-green automated checks are necessary but not sufficient for APPROVED.
 
 ## Feature Document Format
 
@@ -197,6 +207,10 @@ Source: docs/features/in-progress/<name>.md
 <One actionable sentence>
 ```
 
+## Research Foundations
+
+The cognitive science and social science mechanisms behind this workflow's design are documented in `docs/academic_research.md` — pre-mortem, implementation intentions, adversarial collaboration, elaborative encoding, and 11 other mechanisms with full citations.
+
 ## Setup
 
 To initialize a new project from this template:
diff --git a/docs/academic_research.md b/docs/academic_research.md
new file mode 100644
index 0000000..1de562d
--- /dev/null
+++ b/docs/academic_research.md
@@ -0,0 +1,192 @@
+# Academic Research — Theoretical Foundations
+
+This document explains the cognitive and social-science mechanisms that justify the workflow reforms in this template. Each mechanism is grounded in peer-reviewed research.
+
+---
+
+## Mechanisms
+
+### 1. Pre-mortem (Prospective Hindsight)
+
+| | |
+|---|---|
+| **Source** | Klein, G. (1998). *Sources of Power: How People Make Decisions*. MIT Press. |
+| **Core finding** | Asking "imagine this failed — why?" catches 30% more issues than forward-looking review. |
+| **Mechanism** | Prospective hindsight shifts from prediction (weak) to explanation (strong). The brain is better at explaining past events than predicting future ones. By framing as "it already failed," you activate explanation mode. |
+| **Where used** | PO pre-mortem at scope, developer pre-mortem before handoff. |
+
+---
+
+### 2. Implementation Intentions
+
+| | |
+|---|---|
+| **Source** | Gollwitzer, P. M. (1999). Implementation intentions: Strong effects of simple planning aids. *American Journal of Preventive Medicine*, 16(4), 257–276. |
+| **Core finding** | "If X then Y" plans are 2–3x more likely to execute than general intentions. |
+| **Mechanism** | If-then plans create automatic cue-response links in memory. The brain processes "if function > 20 lines then extract helper" as an action trigger, not a suggestion to consider. |
+| **Where used** | Refactor Self-Check Gates in `implementation/SKILL.md`, Structural Quality Checks in `code-quality/SKILL.md`. |
+
+---
+
+### 3. Commitment Devices
+
+| | |
+|---|---|
+| **Source** | Cialdini, R. B. (2001). *Influence: The Psychology of Persuasion* (rev. ed.). HarperBusiness. |
+| **Core finding** | Forcing an explicit micro-commitment (filling in a PASS/FAIL cell) creates resistance to reversals. A checkbox checked is harder to uncheck than a todo noted. |
+| **Mechanism** | Structured tables with PASS/FAIL cells create commitment-device effects. The act of marking "FAIL" requires justification, making silent passes psychologically costly. |
+| **Where used** | SOLID enforcement table, ObjCal enforcement table, Design Patterns table — all require explicit PASS/FAIL with evidence. |
+
+---
+
+### 4. System 2 Before System 1
+
+| | |
+|---|---|
+| **Source** | Kahneman, D. (2011). *Thinking, Fast and Slow*. Farrar, Straus and Giroux. |
+| **Core finding** | System 1 (fast, automatic) is vulnerable to anchoring and confirmation bias. System 2 (slow, deliberate) must be activated before System 1's judgments anchor. |
+| **Mechanism** | Running semantic review *before* automated commands prevents the "all green" dopamine hit from anchoring the reviewer's judgment. Doing hard cognitive work first protects against System 1 shortcuts. |
+| **Where used** | Verification order in `verify/SKILL.md`: semantic alignment check before commands. |
+
+---
+
+### 5. Adversarial Collaboration
+
+| | |
+|---|---|
+| **Source** | Mellers, B. A., Hertwig, R., & Kahneman, D. (2001). Do frequency representations eliminate cooperative bias? *Psychological Review*, 108(4), 709–735. |
+| **Core finding** | Highest-quality thinking emerges when parties hold different hypotheses and are charged with finding flaws in each other's reasoning. |
+| **Mechanism** | Explicitly framing the reviewer as "your job is to break this feature" activates the adversarial collaboration mode. The reviewer seeks disconfirmation rather than confirmation. |
+| **Where used** | Adversarial mandate in `reviewer.md` and `verify/SKILL.md`. |
+
+---
+
+### 6. Accountability to Unknown Audience
+
+| | |
+|---|---|
+| **Source** | Tetlock, P. E. (1983). Accountability: A social determinant of judgment. In M. D. B. T. Strother (Ed.), *Psychology of Learning and Motivation* (Vol. 17, pp. 295–332). Academic Press. |
+| **Core finding** | Accountability to an unknown audience with unknown views improves reasoning quality. The agent anticipates being audited and adjusts reasoning. |
+| **Mechanism** | The explicit report format (APPROVED/REJECTED with evidence) creates an accountability structure — the reviewer's reasoning will be read by the PO. |
+| **Where used** | Report format in `verify/SKILL.md`, structured evidence columns in all enforcement tables. |
+
+---
+
+### 7. Chunking and Cognitive Load Reduction
+
+| | |
+|---|---|
+| **Source** | Miller, G. A. (1956). The magical number seven, plus or minus two. *Psychological Review*, 63(2), 81–97. |
+| **Alternative** | Sweller, J. (1988). Cognitive load during problem solving. *Cognitive Science*, 12(2), 257–285. |
+| **Core finding** | Structured tables reduce working memory load vs. narrative text. Chunking related items into table rows enables parallel processing. |
+| **Mechanism** | Replacing prose checklists ("Apply SOLID principles") with structured tables (5 rows, 4 columns) allows the reviewer to process all items in a single pass. |
+| **Where used** | All enforcement tables in `verify/SKILL.md` and `reviewer.md`. |
+
+---
+
+### 8. Elaborative Encoding
+
+| | |
+|---|---|
+| **Source** | Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. *Journal of Verbal Learning and Verbal Behavior*, 11(6), 671–684. |
+| **Core finding** | Deeper processing — explaining *why* a rule matters — leads to better retention and application than shallow processing (just listing rules). |
+| **Mechanism** | Adding a "Why it matters" column to enforcement tables forces the reviewer to process the rationale, not just scan the rule name. |
+| **Where used** | SOLID table, ObjCal table, Design Patterns table — all have "Why it matters" column. |
+
+---
+
+### 9. Error-Specific Feedback
+
+| | |
+|---|---|
+| **Source** | Hattie, J., & Timperley, H. (2007). The power of feedback. *Review of Educational Research*, 77(1), 81–112. |
+| **Core finding** | Feedback is most effective when it tells the agent exactly what went wrong and what the correct action is. "FAIL: function > 20 lines at file:47" is actionable. "Apply function length rules" is not. |
+| **Mechanism** | The evidence column in enforcement tables requires specific file:line references, turning vague rules into actionable directives. |
+| **Where used** | Evidence column in all enforcement tables. |
+
+---
+
+### 10. Prospective Memory Cues
+
+| | |
+|---|---|
+| **Source** | McDaniel, M. A., & Einstein, G. O. (2000). Strategic and automatic processes in prospective memory retrieval. *Applied Cognitive Psychology*, 14(7), S127–S144. |
+| **Core finding** | Memory for intended actions is better when cues are embedded at the point of action, not in a separate appendix. |
+| **Mechanism** | Placing if-then gates inline (in the REFACTOR section) rather than in a separate "reference" document increases adherence. The cue appears exactly when the developer is about to make the relevant decision. |
+| **Where used** | Refactor Self-Check Gates embedded inline in `implementation/SKILL.md`. |
+
+---
+
+### 11. Observable Behavior Testing
+
+| | |
+|---|---|
+| **Source** | Fowler, M. (2018). *The Practical Test Pyramid*. Thoughtworks. https://martinfowler.com/articles/practical-test-pyramid.html |
+| **Core finding** | Tests should answer "if I enter X and Y, will the result be Z?" — not "will method A call class B first?" |
+| **Mechanism** | A test is behavioral if its assertion describes something a caller/user can observe without knowing the implementation. The test should still pass if you completely rewrite the internals. |
+| **Where used** | Contract test rule in `tdd/SKILL.md`: "Write every test as if you cannot see the production code." |
+
+---
+
+### 12. Test-Behavior Alignment
+
+| | |
+|---|---|
+| **Source** | Google Testing Blog (2013). *Testing on the Toilet: Test Behavior, Not Implementation*. |
+| **Core finding** | Test setup may need to change if implementation changes, but the actual test shouldn't need to change if the code's user-facing behavior doesn't change. |
+| **Mechanism** | Tests that are tightly coupled to implementation break on refactoring and become a drag on design improvement. Behavioral tests survive internal rewrites. |
+| **Where used** | Contract test rule + bad example in `tdd/SKILL.md`, reviewer verification check in `reviewer.md`. |
+
+---
+
+### 13. Tests as First-Class Citizens
+
+| | |
+|---|---|
+| **Source** | Martin, R. C. (2017). *First-Class Tests*. Clean Coder Blog. |
+| **Core finding** | Tests should be treated as first-class citizens of the system — not coupled to implementation. Bad tests are worse than no tests because they give false confidence. |
+| **Mechanism** | Tests written as "contract tests" — describing what the caller observes — remain stable through refactoring. Tests that verify implementation details are fragile and create maintenance burden. |
+| **Where used** | Contract test rule in `tdd/SKILL.md`, verification check in `reviewer.md`. |
+
+---
+
+### 14. Property-Based Testing (Invariant Discovery)
+
+| | |
+|---|---|
+| **Source** | MacIver, D. R. (2016). *What is Property Based Testing?* Hypothesis. https://hypothesis.works/articles/what-is-property-based-testing/ |
+| **Core finding** | Property-based testing is "the construction of tests such that, when these tests are fuzzed, failures reveal problems that could not have been revealed by direct fuzzing." Property tests test *invariants* — things that must always be true about the contract, not things that fall out of how you wrote it. |
+| **Mechanism** | Meaningful property tests assert invariants: "assert Score(x).value >= 0" tests the contract. Tautological tests assert reconstruction: "assert Score(x).value == x" tests the implementation. |
+| **Where used** | Meaningful vs. Tautological table in `tdd/SKILL.md`, Property-Based Testing Decision Rule table in `tdd/SKILL.md`. |
+
+---
+
+### 15. Mutation Testing (Test Quality Verification)
+
+| | |
+|---|---|
+| **Source** | King, K. N. (1991). *The Gamma (formerly mutants)*. |
+| **Alternative** | Mutation testing tools: Cosmic Ray, mutmut (Python) |
+| **Core finding** | A meaningful test fails when a mutation (small deliberate code change) is introduced. A tautological test passes even with mutations because it doesn't constrain the behavior. |
+| **Mechanism** | If a test survives every mutation of the production code without failing, it tests nothing. Only tests that fail on purposeful "damage" to the code are worth keeping. |
+| **Where used** | Note in `tdd/SKILL.md` Quality Rules (implicitly encouraged: tests must describe contracts, not implementation, which is the theoretical complement to mutation testing). |
+
+---
+
+## Bibliography
+
+1. Cialdini, R. B. (2001). *Influence: The Psychology of Persuasion* (rev. ed.). HarperBusiness.
+2. Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. *Journal of Verbal Learning and Verbal Behavior*, 11(6), 671–684.
+3. Gollwitzer, P. M. (1999). Implementation intentions: Strong effects of simple planning aids. *American Journal of Preventive Medicine*, 16(4), 257–276.
+4. Hattie, J., & Timperley, H. (2007). The power of feedback. *Review of Educational Research*, 77(1), 81–112.
+5. Kahneman, D. (2011). *Thinking, Fast and Slow*. Farrar, Straus and Giroux.
+6. Klein, G. (1998). *Sources of Power: How People Make Decisions*. MIT Press.
+7. McDaniel, M. A., & Einstein, G. O. (2000). Strategic and automatic processes in prospective memory retrieval. *Applied Cognitive Psychology*, 14(7), S127–S144.
+8. Mellers, B. A., Hertwig, R., & Kahneman, D. (2001). Do frequency representations eliminate cooperative bias? *Psychological Review*, 108(4), 709–735.
+9. Miller, G. A. (1956). The magical number seven, plus or minus two. *Psychological Review*, 63(2), 81–97.
+10. Sweller, J. (1988). Cognitive load during problem solving. *Cognitive Science*, 12(2), 257–285.
+11. Tetlock, P. E. (1983). Accountability: A social determinant of judgment. In M. D. B. T. Strother (Ed.), *Psychology of Learning and Motivation* (Vol. 17, pp. 295–332). Academic Press.
+12. Fowler, M. (2018). The Practical Test Pyramid. *Thoughtworks*. https://martinfowler.com/articles/practical-test-pyramid.html
+13. Google Testing Blog. (2013). Testing on the Toilet: Test Behavior, Not Implementation.
+14. Martin, R. C. (2017). First-Class Tests. *Clean Coder Blog*.
+15. MacIver, D. R. (2016). What is Property Based Testing? *Hypothesis*. https://hypothesis.works/articles/what-is-property-based-testing/