diff --git a/README.md b/README.md index bda1f5e..ee982cd 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,9 @@ # Workspine -AI agents forget when the session ends. Workspine writes plans, decisions, and verification to `.planning/` so any agent or runtime can pick up where the last one stopped. +A repo-native delivery spine for the part of AI coding that still needs human judgment: planning, checking, execution, verification, and handoff. + +Workspine keeps plans, decisions, proof, and handoff state in the repo so another session, agent, or runtime can continue from repo truth instead of chat memory. [![npm version](https://img.shields.io/npm/v/gsdd-cli?style=for-the-badge&logo=npm&logoColor=white&color=CB3837)](https://www.npmjs.com/package/gsdd-cli) [![License](https://img.shields.io/badge/license-MIT-blue?style=for-the-badge)](LICENSE) @@ -11,98 +13,150 @@ AI agents forget when the session ends. Workspine writes plans, decisions, and v npx -y gsdd-cli init ``` -**Validated:** Claude Code, Codex CLI, OpenCode. **Qualified:** Cursor, Copilot, Gemini. +Directly validated today: Claude Code, Codex CLI, OpenCode. +Qualified support: Cursor, Copilot, Gemini can use the shared skills surface when their skill or slash discovery sees it; proof and ergonomics differ from the directly validated runtimes. --- -## How it works +## What This Is -`init` places workflow skills in `.agents/skills/` and optionally native adapters for your runtime. Then you run workflows through your agent — each one writes files to the repo: +Workspine is a small set of workflow sources plus the `gsdd` CLI. It creates: -| Workflow | Writes | What for | -|----------|--------|----------| -| `gsdd-new-project` | `.planning/SPEC.md`, `ROADMAP.md` | Define the project and phases | -| `gsdd-plan` | `.planning/phases/N/PLAN.md` | Research and review before any code gets written | -| `gsdd-execute` | `.planning/phases/N/SUMMARY.md` | Implement the approved plan, nothing more | -| `gsdd-verify` | `.planning/phases/N/VERIFICATION.md` | Confirm the plan's claims are actually true | +- `.planning/` for specs, roadmaps, phase plans, summaries, verification reports, and handoff state. +- `.agents/skills/gsdd-*/SKILL.md` as the portable workflow entry surface. +- `.planning/bin/gsdd.mjs` as the repo-local helper runtime for deterministic commands from the repo root. +- Optional runtime adapters for tools that benefit from native surfaces. -The discipline: plan first, execute only what's approved, verify before closing. Each phase summary carries forward what was decided, so the next session starts with context instead of from scratch. +Workspine ships 14 workflows. The product name is Workspine, while the package, CLI commands, workflow prefixes, and workspace directory remain `gsdd-cli`, `gsdd`, `gsdd-*`, and `.planning/` - these are retained technical contracts, not rename residue. -Workspine ships 14 workflows. The package and CLI are `gsdd-cli` / `gsdd-*` — retained as the technical contract under the Workspine product name. +Workspine began as a fork of Get Shit Done, whose long-horizon delivery spine proved the problem was real. Workspine keeps that useful discipline while narrowing the public surface around repo-native planning, execution, verification, and handoff. ---- +Launch proof posture: + +- Directly validated in repo truth: Claude Code, Codex CLI, OpenCode. +- Qualified support only: Cursor, Copilot, Gemini can use `.agents/skills/` plus optional governance when skill or slash discovery is available. +- Codex CLI is separate from Codex VS Code and the Codex app; use native discovery there when available, otherwise open or paste `.agents/skills/gsdd-*/SKILL.md`. +- Generated runtime surfaces are checked by `gsdd health` against current render output and repaired deterministically with `npx -y gsdd-cli update`. +- Public proof entrypoints: [Brownfield Proof](docs/BROWNFIELD-PROOF.md), [consumer proof pack](docs/proof/consumer-node-cli/README.md), [Runtime Support](docs/RUNTIME-SUPPORT.md), and [Verification Discipline](docs/VERIFICATION-DISCIPLINE.md). + +## Getting Started -## Get started +### Quickstart + +Run the guided install wizard in a project root: ```bash -npx -y gsdd-cli init # guided wizard -npx -y gsdd-cli init --tools claude # Claude Code only -npx -y gsdd-cli init --tools opencode # OpenCode only -npx -y gsdd-cli init --tools codex # Codex CLI only -npx -y gsdd-cli init --tools all # all runtimes -npx -y gsdd-cli init --auto --tools all # headless / CI +npx -y gsdd-cli init ``` -### Which workflow to start with +Then invoke workflows through your agent: -| Situation | Start here | -|-----------|------------| -| New project, or brownfield work that's broad / milestone-shaped | `gsdd-new-project` — full initializer, runs codebase mapping internally when needed | -| Existing repo, and the change you want to make is already concrete | `gsdd-quick` — bounded-change lane, lighter ceremony | -| Existing repo is unfamiliar or risky and you want a baseline first | `gsdd-map-codebase` — orientation pass before choosing the above | +- Claude Code / OpenCode: use slash commands such as `/gsdd-plan`. +- Codex CLI: use skill references such as `$gsdd-plan`. +- Cursor / Copilot / Gemini: Use slash commands if your tool discovers them; if it does not, open `.agents/skills/gsdd-/SKILL.md`. +- Any other agent: open the matching `SKILL.md` file directly. + +Headless setup is available for scripts and prepared briefs: -### Invoke through your agent +```bash +npx -y gsdd-cli init --auto --tools codex --brief brief.md +``` + +### Invoke Through Your Agent | Runtime | How | |---------|-----| | Claude Code / OpenCode | `/gsdd-plan` slash command | -| Codex CLI | `$gsdd-plan` skill reference | -| Codex VS Code / app | Native discovery if available | -| Cursor / Copilot / Gemini | Slash command if discovered | +| Codex CLI | `$gsdd-plan` skill reference; Codex uses the portable `gsdd-plan` entry and can add a native checker agent at `.codex/agents/gsdd-plan-checker.toml` | +| Codex VS Code / app | Native discovery if available; otherwise open or paste the generated `SKILL.md` | +| Cursor / Copilot / Gemini | `/gsdd-plan` slash command when skill/slash discovery is available; if it is not, open `.agents/skills/gsdd-/SKILL.md` | | Any other agent | Open `.agents/skills/gsdd-plan/SKILL.md` | -### Team use +### Which Workflow To Start With + +| Situation | Start here | +|-----------|------------| +| New project, broad brownfield work, or milestone-shaped work | `gsdd-new-project` | +| Existing repo with a concrete bounded change | `gsdd-quick` | +| Unfamiliar or risky repo where you want orientation first | `gsdd-map-codebase` | -Commit `.planning/` so the team shares specs, roadmaps, phase plans, and verification reports. Each developer runs `init --tools ` for their own runtime adapters without changing the shared delivery artifacts. +### Team Use ---- +Commit shared planning artifacts when `commitDocs` is enabled for the team. Developers can regenerate their own runtime adapters with `npx -y gsdd-cli init --tools ` without changing the shared delivery state. -## Where it fits +### What to Track in Git -Use Workspine when a feature takes more than one session, or when you need to switch between Claude, Codex, and Cursor without losing the thread. Skip it for quick, obvious edits — direct prompting is cheaper when the risk is small. +Track `.planning/SPEC.md`, `.planning/ROADMAP.md`, phase plans, summaries, verification reports, and public proof docs. Treat `.planning/.local/`, local browser captures, unsafe screenshots, and machine-specific runtime artifacts as local-only unless a plan explicitly narrows and approves publication. -| Tool | Good for | vs Workspine | -|------|----------|--------------| -| **Workspine** | Work that spans sessions, agents, or runtimes where plans and proof need to stay in the repo | — | -| [GSD](https://github.com/gsd-build/get-shit-done) | Broad AI prompting suite — 81 commands, 78 workflows, 33 agents | Workspine is narrower: 14 workflows, fewer moving parts for the human in the loop | -| [OpenSpec](https://openspec.dev/) | Living spec + change proposals in a lightweight format | Workspine adds the execution, verification, and handoff layer on top of planning | -| [LeanSpec](https://www.lean-spec.dev/docs/guide/first-principles) | Minimal specs that fit LLM context | Workspine adds workflow gates and runtime entrypoints for when you need the full structure | -| [GitHub Spec Kit](https://github.com/github/spec-kit) | Spec-first planning workflows in `.specify/` | Similar space; Workspine is one CLI with one delivery loop instead of a broader ecosystem | -| [Kiro](https://kiro.dev/docs/) | IDE-native agent dev with specs, steering, hooks, and MCP | Kiro is IDE-only; Workspine works across terminal and IDE agents that can read repo files | -| [Tessl](https://tessl.io/enterprise/) | Hosted platform for distributing agent skills across teams | Tessl needs a control plane; Workspine is local-first with no hosted infrastructure | +## Workflow -Based on each tool's public docs as of May 2026. Open an issue if anything reads inaccurately. +```text +npx -y gsdd-cli init -> bootstrap .planning/, skills, and optional adapters +/gsdd-new-project -> create SPEC.md and ROADMAP.md +/gsdd-plan N -> create a reviewed phase plan +/gsdd-execute N -> implement the approved plan +/gsdd-verify N -> verify before closing the phase +/gsdd-audit-milestone -> check cross-phase integration +/gsdd-complete-milestone -> archive and evolve the roadmap +/gsdd-new-milestone -> begin the next milestone +/gsdd-quick -> bounded task outside the phase cycle +/gsdd-pause -> write a checkpoint +/gsdd-resume -> restore context and route next action +/gsdd-progress -> report status without mutating files +``` ---- +## Configuration -## CLI +Use model profiles to trade cost against review depth: ```bash -npx -y gsdd-cli health # workspace integrity check -npx -y gsdd-cli update # regenerate stale runtime surfaces -npx -y gsdd-cli models profile quality # maximize review rigor -npx -y gsdd-cli models profile budget # minimize cost -npx -y gsdd-cli control-map # repo and planning state at a glance +npx -y gsdd-cli models profile quality # maximize review rigor +npx -y gsdd-cli models profile balanced # default balance +npx -y gsdd-cli models profile budget # minimize cost +npx -y gsdd-cli rigor thorough # raise planning/review rigor ``` -Full reference: [User Guide](docs/USER-GUIDE.md) · [Runtime Support](docs/RUNTIME-SUPPORT.md) · [Verification Discipline](docs/VERIFICATION-DISCIPLINE.md) +`npx -y gsdd-cli health` checks generated runtime surfaces against current render output. If surfaces drift, repair them with `npx -y gsdd-cli update` or regenerate only templates with `npx -y gsdd-cli update --templates`. ---- +## CLI Commands + +| Command | Purpose | +|---------|---------| +| `npx -y gsdd-cli init` | Guided install wizard and headless initialization | +| `npx -y gsdd-cli update --templates` | Regenerate installed runtime surfaces and templates | +| `npx -y gsdd-cli models` | Inspect or set model profiles | +| `npx -y gsdd-cli rigor` | Set workflow rigor defaults | +| `npx -y gsdd-cli health` | Check workspace integrity and generated-surface freshness | +| `npx -y gsdd-cli ui-proof validate` | Validate UI proof metadata | +| `npx -y gsdd-cli ui-proof compare` | Compare planned UI proof slots to observed bundles | +| `npx -y gsdd-cli control-map` | Show repo, worktree, and planning state | +| `npx -y gsdd-cli closeout-report` | Replay closeout blockers, warnings, and next action | +| `npx -y gsdd-cli find-phase` | Resolve a phase number or title | +| `npx -y gsdd-cli phase-status` | Update ROADMAP phase status deterministically | +| `npx -y gsdd-cli verify` | Run direct phase verification helpers | +| `npx -y gsdd-cli scaffold` | Scaffold planning artifacts for tests or fixtures | +| `npx -y gsdd-cli session-fingerprint` | Compute a local session fingerprint | +| `npx -y gsdd-cli file-op` | Run deterministic file copy/delete helpers used by generated workflows | +| `npx -y gsdd-cli help` | Show CLI help | + +`ui-proof validate` and `ui-proof compare` also understand optional browser runtime capture annotations, so plans can record provider choice, screenshot/snapshot modes, budgets, and fallback reasons without installing browser tooling by default. + +Full reference: [User Guide](docs/USER-GUIDE.md), [Runtime Support](docs/RUNTIME-SUPPORT.md), [Verification Discipline](docs/VERIFICATION-DISCIPLINE.md). + +## Troubleshooting + +Start with: + +```bash +npx -y gsdd-cli health +``` + +If health reports stale generated surfaces, run `npx -y gsdd-cli update`. For command usage and recovery examples, see the [User Guide](docs/USER-GUIDE.md). ## Credits -Fork of [Get Shit Done](https://github.com/gsd-build/get-shit-done) by [Lex Christopherson](https://github.com/glittercowboy), MIT licensed. Original git history retained. +Workspine began as a fork of [Get Shit Done](https://github.com/gsd-build/get-shit-done) by [Lex Christopherson](https://github.com/glittercowboy), MIT licensed. Original git history retained. MIT License. See [LICENSE](LICENSE) for details. diff --git a/agents/executor.md b/agents/executor.md index a908ab6..d585046 100644 --- a/agents/executor.md +++ b/agents/executor.md @@ -222,6 +222,8 @@ Use `agent-browser` as the default live UI proof path: If `agent-browser` is unavailable, record the availability constraint and closest project-native interactive browser fallback in the proof bundle instead of silently treating the fallback as the default path. Existing Playwright/package-script browser tests remain canonical repeatable regression evidence when present; use Playwright scripting only for checks `agent-browser` cannot cover cleanly, such as JS-disabled, structured console, or multi-context verification. +If a planned slot declares `runtime_capture_requirements`, add observed `runtime_capture` metadata with selected provider, fallback chain, availability, capture modes, latency, text/raw byte counts, estimated tokens, screenshot counts, bounded computed-style/property counts, fidelity limits, and artifact refs. Direct-CDP is an escalation for selected DOM/CSS/computed-style, console, network, or framework-state proof; Chrome DevTools MCP and Playwright MCP are optional only when already configured and scoped. Keep raw screenshots, traces, videos, DOM, reports, console/network logs, and framework state local-only or summarized unless explicitly sanitized. + Artifact metadata must include `visibility`, `retention`, `sensitivity`, and `safe_to_publish`; raw screenshots, traces, videos, DOM snapshots, and reports are local-only/unsafe by default and cannot back public, tracked, delivery, release, or publication proof claims. Use `gsdd ui-proof validate ` or `gsdd health` when a bundle exists. Artifact count, source comments, AST/cAST findings, semantic search, and Semble-like retrieval are not proof. Missing or weakly linked evidence must be recorded as proof debt, waiver, deferment, or reduced claim language rather than satisfied proof. diff --git a/agents/planner.md b/agents/planner.md index c8e41fa..fcf2d11 100644 --- a/agents/planner.md +++ b/agents/planner.md @@ -160,6 +160,8 @@ Require observed artifacts to carry `visibility`, `retention`, `sensitivity`, an For live rendered UI proof, plan `agent-browser` as the default runtime evidence path: route open, interactive snapshot/refs when relevant, changed-flow interaction, screenshots for the planned viewport(s), and relevant console/network observations. If `agent-browser` is unavailable in the runtime, require an explicit availability constraint and the closest project-native interactive browser fallback before narrowing the claim. Existing Playwright/package-script browser tests remain the canonical repeatable regression path when present; do not scaffold new browser infrastructure by default. The planner chooses viewport coverage, but must explain why the viewport set is sufficient for the claim or narrow the claim limit; responsive claims need desktop/mobile or equivalent state coverage. +When the UI claim needs provider choice, capture fidelity, or token/artifact cost to be benchmarked, add optional `runtime_capture_requirements` with provider preference, fallback policy, required/optional modes, and budgets. Keep `agent-browser` first; direct-CDP is only an escalation for selected DOM/CSS/computed-style, console, network, or framework-state claims; Chrome DevTools MCP and Playwright MCP are optional only when already configured. Do not plan browser installs, browser MCP, CI, Storybook, or visual-regression infrastructure for this metadata. + Do not let source annotations, AST/cAST findings, semantic search, comments, or Semble-like retrieval satisfy proof slots; they are discovery hints only. Human acceptance can narrow or waive a claim and record proof debt, but it must not turn missing or mismatched non-human evidence into `satisfied` proof. diff --git a/agents/verifier.md b/agents/verifier.md index cb73bed..3643326 100644 --- a/agents/verifier.md +++ b/agents/verifier.md @@ -123,7 +123,7 @@ Do not return a flat symptom list when the same underlying breakage explains mul Visual correctness, live interaction quality, and some external integrations still need explicit human checks. -For UI proof slots, fail closed unless observed proof is matched to the exact claim, route/state, observation, evidence kind, artifact path or manual step, privacy metadata, result, and claim limit. For live UI runtime proof, expect `agent-browser` as the default captured tool unless the observed bundle explains a project-native equivalent or an availability constraint; do not fail solely because another browser tool was used, but downgrade vague proof that lacks exact route/state, viewport coverage or rationale, interactive steps/refs where relevant, screenshot/report artifacts, or relevant console/network observations. Existing Playwright/package-script browser tests count as canonical repeatable regression evidence, not as a replacement for scoped runtime proof when the slot requires `runtime`. Artifact metadata must include `visibility`, `retention`, `sensitivity`, and `safe_to_publish`; local-only or unsafe artifacts cannot back public, tracked, delivery, release, or publication proof claims, and `gsdd ui-proof validate`/`gsdd health` metadata failures block the stronger proof claim. Screenshots, traces, reports, Gherkin, a11y scans, E2E outputs, manual notes, source annotations, AST/cAST findings, semantic search, comments, and Semble-like retrieval do not satisfy proof by existence alone. Human acceptance records risk, waiver, deferment, proof debt, or a narrowed claim; it does not upgrade missing or mismatched non-human proof to `satisfied`. +For UI proof slots, fail closed unless observed proof is matched to the exact claim, route/state, observation, evidence kind, artifact path or manual step, privacy metadata, result, and claim limit. If a planned slot includes `runtime_capture_requirements`, require linked observed `runtime_capture` metadata with required passed modes, selected provider, fallback rationale when selected provider differs from preference, and budget totals within the planned limits; missing capture metadata, missing modes, over-budget captures, or unexplained fallback remain `partial` or `missing`. For live UI runtime proof, expect `agent-browser` as the default captured tool unless the observed bundle explains a project-native equivalent or an availability constraint; do not fail solely because another browser tool was used, but downgrade vague proof that lacks exact route/state, viewport coverage or rationale, interactive steps/refs where relevant, screenshot/report artifacts, or relevant console/network observations. Treat direct-CDP as escalation for selected DOM/CSS/computed-style, console, network, or framework-state claims, and count Chrome DevTools MCP or Playwright MCP only when already configured and scoped. Existing Playwright/package-script browser tests count as canonical repeatable regression evidence, not as a replacement for scoped runtime proof when the slot requires `runtime`. Artifact metadata must include `visibility`, `retention`, `sensitivity`, and `safe_to_publish`; local-only or unsafe artifacts cannot back public, tracked, delivery, release, or publication proof claims, and `gsdd ui-proof validate`/`gsdd health` metadata failures block the stronger proof claim. Screenshots, traces, reports, Gherkin, a11y scans, E2E outputs, manual notes, source annotations, AST/cAST findings, semantic search, comments, and Semble-like retrieval do not satisfy proof by existence alone. Human acceptance records risk, waiver, deferment, proof debt, or a narrowed claim; it does not upgrade missing or mismatched non-human proof to `satisfied`. ## Step 9: Determine overall status diff --git a/bin/gsdd.mjs b/bin/gsdd.mjs index 1972ff2..3a4e234 100644 --- a/bin/gsdd.mjs +++ b/bin/gsdd.mjs @@ -28,7 +28,6 @@ const DISTILLED_DIR = join(__dirname, '..', 'distilled'); const AGENTS_DIR = join(__dirname, '..', 'agents'); const PACKAGE_JSON = JSON.parse(readFileSync(join(__dirname, '..', 'package.json'), 'utf-8')); const IS_MAIN = process.argv[1] ? realpathSync(process.argv[1]) === realpathSync(__filename) : false; - const [,, command, ...args] = process.argv; function defineWorkflow({ mutatesArtifacts = true, ...workflow }) { diff --git a/bin/lib/health.mjs b/bin/lib/health.mjs index 4e20b77..f62f9c6 100644 --- a/bin/lib/health.mjs +++ b/bin/lib/health.mjs @@ -151,7 +151,7 @@ export function buildHealthReport(ctx, healthArgs = []) { id: 'E10', severity: 'ERROR', message: `${relativePath} has invalid UI proof metadata (${validation.errors.map((entry) => entry.code).join(', ')})`, - fix: 'Run `gsdd ui-proof validate ` and add required privacy metadata, claim limits, fixed evidence kinds, concise tool provenance, failure classification when failed or partial, observation artifact references, existing local artifact paths, and safe-to-publish handling.', + fix: 'Run `gsdd ui-proof validate ` and add required privacy metadata, claim limits, fixed evidence kinds, concise tool provenance, runtime capture benchmark fields when present, failure classification when failed or partial, observation artifact references, existing local artifact paths, and safe-to-publish handling.', }); } } diff --git a/bin/lib/init-runtime.mjs b/bin/lib/init-runtime.mjs index fffba4f..f1a4ad7 100644 --- a/bin/lib/init-runtime.mjs +++ b/bin/lib/init-runtime.mjs @@ -189,9 +189,9 @@ Commands: session-fingerprint write [--allow-changed ] Rebaseline planning-state drift after reviewing changed planning files ui-proof validate [--claim ] - Validate UI proof metadata; use --claim for stronger proof uses + Validate UI proof metadata, including optional runtime capture annotations ui-proof compare [observed-bundle-json ...] - Compare planned UI proof slots against observed bundles + Compare planned UI proof slots and runtime capture requirements against observed bundles control-map [--json] [--with-ignored] [--annotations ] Report computed repo/worktree/planning state and local annotations control-map annotate @@ -270,7 +270,7 @@ Advanced/internal helpers (kept available, but not the primary first-run user st lifecycle-preflight Inspect deterministic lifecycle gate results for a workflow surface session-fingerprint Rebaseline the local planning-state fingerprint after review phase-status Update ROADMAP.md phase status through the local helper surface - ui-proof Validate UI proof metadata and compare planned slots to observed bundles + ui-proof Validate UI proof metadata and compare planned slots/capture requirements to observed bundles control-map Report computed repo/worktree/planning state; annotate only records local intent closeout-report Read-only post-merge closure replay; reports blockers, warnings, and next safe action file-op Deterministic workspace-confined file copy/delete/text mutation diff --git a/bin/lib/rendering.mjs b/bin/lib/rendering.mjs index 9ec076a..9d4fdca 100644 --- a/bin/lib/rendering.mjs +++ b/bin/lib/rendering.mjs @@ -90,9 +90,9 @@ function printHelp() { ' session-fingerprint write [--allow-changed ]', ' Rebaseline planning-state drift after reviewing changed planning files', ' ui-proof validate [--claim ]', - ' Validate UI proof metadata; use --claim for stronger proof uses', + ' Validate UI proof metadata, including optional runtime capture annotations', ' ui-proof compare [observed-bundle-json ...]', - ' Compare planned UI proof slots against observed bundles', + ' Compare planned UI proof slots and runtime capture requirements against observed bundles', ' control-map [--json] [--with-ignored] [--annotations ]', ' Report computed repo/worktree/planning state and local annotations', ' closeout-report [--json] [--phase ]', diff --git a/bin/lib/ui-proof.mjs b/bin/lib/ui-proof.mjs index a016882..5d21796 100644 --- a/bin/lib/ui-proof.mjs +++ b/bin/lib/ui-proof.mjs @@ -11,6 +11,38 @@ const RAW_ARTIFACT_TYPES = Object.freeze(['screenshot', 'trace', 'video', 'dom_s const PUBLIC_CLAIM_USES = Object.freeze(['public', 'publication', 'tracked', 'delivery', 'release']); const CLAIM_USES = Object.freeze([...PUBLIC_CLAIM_USES, 'local', 'local_only']); const FAILURE_CLASSIFICATIONS = Object.freeze(['product_bug', 'missing_infra', 'flaky_harness', 'ambiguous_spec']); +const RUNTIME_CAPTURE_MODES = Object.freeze([ + 'screenshot', + 'interactive_snapshot', + 'accessibility_snapshot', + 'dom_subset', + 'selected_element_dom', + 'computed_style', + 'console_delta', + 'network_delta', + 'framework_state', + 'manual_observation', +]); +const RUNTIME_CAPTURE_AVAILABILITY_STATUSES = Object.freeze(['available', 'unavailable', 'not_configured', 'skipped', 'failed']); +const RUNTIME_CAPTURE_METRIC_FIELDS = Object.freeze([ + 'latency_ms', + 'raw_bytes', + 'text_bytes', + 'estimated_tokens', + 'screenshot_count', + 'computed_style_properties', + 'console_event_count', + 'network_event_count', +]); +const RUNTIME_CAPTURE_BUDGET_FIELD_MAP = Object.freeze({ + text_bytes_max: 'text_bytes', + estimated_tokens_max: 'estimated_tokens', + raw_artifact_bytes_max: 'raw_bytes', + screenshot_count_max: 'screenshot_count', + computed_style_properties_max: 'computed_style_properties', + console_event_count_max: 'console_event_count', + network_event_count_max: 'network_event_count', +}); const TOOL_ID_PATTERN = /^[a-z0-9][a-z0-9_.:-]*$/; const REQUIRED_BUNDLE_FIELDS = Object.freeze([ 'proof_bundle_version', @@ -265,6 +297,63 @@ function validateClaimLimits(bundle, errors) { } } +function validateRuntimeCaptureProviderId(value, path, errors) { + if (!hasValue(value)) return; + if (typeof value !== 'string' || !TOOL_ID_PATTERN.test(value)) { + addError(errors, 'invalid_runtime_capture_provider_id', path, `Invalid runtime capture provider identifier: ${value}`, 'Use a concise lowercase provider identifier without spaces; do not encode provider-specific schema in the validator.'); + } +} + +function validateNonNegativeNumber(value, path, errors, code, label) { + if (!hasValue(value)) return; + if (typeof value !== 'number' || !Number.isFinite(value) || value < 0) { + addError(errors, code, path, `${label} must be a non-negative number.`, 'Record numeric runtime capture costs as non-negative numbers, or omit the field when unknown.'); + } +} + +function validateRuntimeCaptureModes(values, path, errors) { + for (const [index, mode] of normalizeArray(values).entries()) { + if (!RUNTIME_CAPTURE_MODES.includes(mode)) { + addError(errors, 'unsupported_runtime_capture_mode', `${path}[${index}]`, `Unsupported runtime capture mode: ${mode}`, `Use only: ${RUNTIME_CAPTURE_MODES.join(', ')}.`); + } + } +} + +function runtimeCaptureRequirements(slot) { + return slot?.runtime_capture_requirements || slot?.runtimeCaptureRequirements; +} + +function validateRuntimeCaptureRequirements(requirements, path, errors) { + if (!hasValue(requirements)) return; + if (!isPlainObject(requirements)) { + addError(errors, 'invalid_runtime_capture_requirements', path, 'Runtime capture requirements must be an object.', 'Record provider preferences, capture modes, and budgets as structured metadata.'); + return; + } + + for (const [index, provider] of normalizeArray(requirements.provider_preference || requirements.providerPreference).entries()) { + validateRuntimeCaptureProviderId(provider, `${path}.provider_preference[${index}]`, errors); + } + if (hasValue(requirements.fallback_policy || requirements.fallbackPolicy) && typeof (requirements.fallback_policy || requirements.fallbackPolicy) !== 'string') { + addError(errors, 'invalid_runtime_capture_fallback_policy', `${path}.fallback_policy`, 'Runtime capture fallback policy must be a string.', 'Record a concise fallback policy such as record_availability_and_narrow_claim.'); + } + validateRuntimeCaptureModes(requirements.required_modes || requirements.requiredModes, `${path}.required_modes`, errors); + validateRuntimeCaptureModes(requirements.optional_modes || requirements.optionalModes, `${path}.optional_modes`, errors); + + const budgets = requirements.budgets; + if (!hasValue(budgets)) return; + if (!isPlainObject(budgets)) { + addError(errors, 'invalid_runtime_capture_budget', `${path}.budgets`, 'Runtime capture budgets must be an object.', 'Record runtime capture budgets as named non-negative numeric limits.'); + return; + } + for (const [field, value] of Object.entries(budgets)) { + if (!Object.prototype.hasOwnProperty.call(RUNTIME_CAPTURE_BUDGET_FIELD_MAP, field)) { + addError(errors, 'unsupported_runtime_capture_budget', `${path}.budgets.${field}`, `Unsupported runtime capture budget field: ${field}`, `Use only: ${Object.keys(RUNTIME_CAPTURE_BUDGET_FIELD_MAP).join(', ')}.`); + continue; + } + validateNonNegativeNumber(value, `${path}.budgets.${field}`, errors, 'invalid_runtime_capture_budget', 'Runtime capture budget'); + } +} + function artifactReference(artifact) { if (!isPlainObject(artifact)) return null; if (typeof artifact.path === 'string' && artifact.path.trim()) return artifact.path.trim(); @@ -366,6 +455,7 @@ export function validateUiProofSlots(slots) { if (normalizeArray(slot.expected_artifact_types || slot.expectedArtifactTypes).length === 0) { addError(errors, 'missing_expected_artifact_types', `${slotPath}.expected_artifact_types`, 'Planned UI proof slot must include expected artifact types.', 'List expected artifact types such as screenshot, trace, report, or dom_snapshot.'); } + validateRuntimeCaptureRequirements(runtimeCaptureRequirements(slot), `${slotPath}.runtime_capture_requirements`, errors); } return { valid: errors.length === 0, errors, warnings: [] }; @@ -399,6 +489,121 @@ function validateObservationArtifactRefs(bundle, artifactRefs, errors) { } } +function validateRuntimeCaptureProvider(provider, path, errors) { + if (!hasValue(provider)) return; + if (!isPlainObject(provider)) { + addError(errors, 'invalid_runtime_capture_provider', path, 'runtime_capture.provider must be an object.', 'Record selected provider, fallback chain, and availability as structured metadata.'); + return; + } + + validateRuntimeCaptureProviderId(provider.primary, `${path}.primary`, errors); + validateRuntimeCaptureProviderId(provider.selected, `${path}.selected`, errors); + + const fallbackChain = provider.fallback_chain || provider.fallbackChain; + if (hasValue(fallbackChain) && !Array.isArray(fallbackChain)) { + addError(errors, 'invalid_runtime_capture_fallback_chain', `${path}.fallback_chain`, 'runtime_capture.provider.fallback_chain must be an array.', 'Record provider fallback order as concise provider identifiers.'); + } + for (const [index, providerId] of normalizeArray(fallbackChain).entries()) { + validateRuntimeCaptureProviderId(providerId, `${path}.fallback_chain[${index}]`, errors); + } + + if (hasValue(provider.fallback_reason || provider.fallbackReason) && typeof (provider.fallback_reason || provider.fallbackReason) !== 'string') { + addError(errors, 'invalid_runtime_capture_fallback_reason', `${path}.fallback_reason`, 'runtime_capture.provider.fallback_reason must be a string.', 'Record why the selected provider differs from the preferred/default provider.'); + } + + const availability = provider.availability; + if (hasValue(availability) && !Array.isArray(availability)) { + addError(errors, 'invalid_runtime_capture_availability', `${path}.availability`, 'runtime_capture.provider.availability must be an array.', 'Record provider availability entries as objects with provider and status.'); + return; + } + for (const [index, entry] of normalizeArray(availability).entries()) { + const entryPath = `${path}.availability[${index}]`; + if (!isPlainObject(entry)) { + addError(errors, 'invalid_runtime_capture_availability', entryPath, 'Runtime capture provider availability entry must be an object.', 'Record provider availability entries as objects with provider and status.'); + continue; + } + validateRuntimeCaptureProviderId(entry.provider, `${entryPath}.provider`, errors); + if (!RUNTIME_CAPTURE_AVAILABILITY_STATUSES.includes(entry.status)) { + addError(errors, 'invalid_runtime_capture_availability_status', `${entryPath}.status`, `Invalid runtime capture provider availability status: ${entry.status}`, `Use only: ${RUNTIME_CAPTURE_AVAILABILITY_STATUSES.join(', ')}.`); + } + } +} + +function validateRuntimeCapture(bundle, artifactRefs, errors) { + const runtimeCapture = bundle?.runtime_capture || bundle?.runtimeCapture; + if (!hasValue(runtimeCapture)) return; + if (!isPlainObject(runtimeCapture)) { + addError(errors, 'invalid_runtime_capture', 'runtime_capture', 'runtime_capture must be an object.', 'Record runtime capture provider, captures, fidelity, and budget metadata as structured JSON.'); + return; + } + + validateRuntimeCaptureProvider(runtimeCapture.provider, 'runtime_capture.provider', errors); + + const captures = runtimeCapture.captures; + if (!Array.isArray(captures) || captures.length === 0) { + addError(errors, 'missing_runtime_capture_captures', 'runtime_capture.captures', 'runtime_capture must include at least one capture entry.', 'Record claim-scoped capture entries such as screenshot, interactive_snapshot, or computed_style.'); + return; + } + + const declaredSlotIds = new Set(normalizeArray(bundle?.scope?.slot_ids)); + for (const [index, capture] of captures.entries()) { + const capturePath = `runtime_capture.captures[${index}]`; + if (!isPlainObject(capture)) { + addError(errors, 'invalid_runtime_capture_capture', capturePath, 'Runtime capture entry must be an object.', 'Record capture mode, slot IDs, result, and cost metadata as an object.'); + continue; + } + + if (!hasValue(capture.mode)) { + addError(errors, 'missing_runtime_capture_mode', `${capturePath}.mode`, 'Runtime capture entry must include mode.', `Use one of: ${RUNTIME_CAPTURE_MODES.join(', ')}.`); + } else if (!RUNTIME_CAPTURE_MODES.includes(capture.mode)) { + addError(errors, 'unsupported_runtime_capture_mode', `${capturePath}.mode`, `Unsupported runtime capture mode: ${capture.mode}`, `Use only: ${RUNTIME_CAPTURE_MODES.join(', ')}.`); + } + + validateRuntimeCaptureProviderId(capture.provider, `${capturePath}.provider`, errors); + + const captureSlotIds = normalizeArray(capture.slot_ids || capture.slotIds); + if (captureSlotIds.length === 0) { + addError(errors, 'missing_runtime_capture_slot_ids', `${capturePath}.slot_ids`, 'Runtime capture entry must declare the slot IDs it supports.', 'Attach each capture to the planned UI proof slot IDs it supports.'); + } + for (const [slotIndex, captureSlotId] of captureSlotIds.entries()) { + if (declaredSlotIds.size > 0 && !declaredSlotIds.has(captureSlotId)) { + addError(errors, 'unknown_runtime_capture_slot', `${capturePath}.slot_ids[${slotIndex}]`, `Runtime capture references undeclared slot: ${captureSlotId}`, 'Use only slot IDs declared in scope.slot_ids.'); + } + } + + if (!hasValue(capture.result)) { + addError(errors, 'missing_runtime_capture_result', `${capturePath}.result`, 'Runtime capture entry must include result.', `Record result using: ${CLAIM_STATUSES.join(', ')}.`); + } else if (!CLAIM_STATUSES.includes(capture.result)) { + addError(errors, 'invalid_runtime_capture_result', `${capturePath}.result`, `Invalid runtime capture result: ${capture.result}`, `Use only: ${CLAIM_STATUSES.join(', ')}.`); + } + + for (const field of RUNTIME_CAPTURE_METRIC_FIELDS) { + validateNonNegativeNumber(capture[field], `${capturePath}.${field}`, errors, 'invalid_runtime_capture_metric', 'Runtime capture metric'); + } + if (hasValue(capture.token_estimate_method) && typeof capture.token_estimate_method !== 'string') { + addError(errors, 'invalid_runtime_capture_token_estimate_method', `${capturePath}.token_estimate_method`, 'Runtime capture token_estimate_method must be a string.', 'Record the token estimation method as concise text, or omit it when unknown.'); + } + + for (const [refIndex, ref] of normalizeArray(capture.artifact_refs || capture.artifactRefs).entries()) { + if (!artifactRefs.has(ref)) { + addError(errors, 'unknown_runtime_capture_artifact_ref', `${capturePath}.artifact_refs[${refIndex}]`, `Runtime capture references undeclared UI proof artifact: ${ref}`, 'Add the artifact to artifacts[] or correct the runtime capture artifact reference.'); + } + } + } + + const fidelity = runtimeCapture.fidelity; + if (!hasValue(fidelity)) return; + if (!isPlainObject(fidelity)) { + addError(errors, 'invalid_runtime_capture_fidelity', 'runtime_capture.fidelity', 'runtime_capture.fidelity must be an object.', 'Record runtime capture fidelity flags as structured metadata.'); + return; + } + for (const field of ['sees_pixels', 'includes_accessibility_tree', 'includes_dom_subset', 'includes_computed_styles', 'includes_framework_state']) { + if (hasValue(fidelity[field]) && typeof fidelity[field] !== 'boolean') { + addError(errors, 'invalid_runtime_capture_fidelity_flag', `runtime_capture.fidelity.${field}`, 'Runtime capture fidelity flags must be boolean.', 'Use true or false for fidelity capability flags.'); + } + } +} + function stableString(value) { return JSON.stringify(canonicalize(value)); } @@ -469,6 +674,11 @@ function comparisonFixHint(code) { missing_claim_limit: 'Preserve the planned claim limit in the observed proof bundle.', missing_expected_artifact_type: 'Attach the planned artifact type, such as screenshot, report, trace, or DOM snapshot.', missing_observed_bundle: 'Create an observed UI proof bundle for the planned slot, or explicitly waive/defer the slot with claim narrowing.', + missing_runtime_capture: 'Add observed runtime_capture metadata for the planned browser capture requirements, or narrow/defer the capture claim.', + missing_runtime_capture_mode: 'Capture the required browser evidence mode for this slot, or remove it from the planned runtime capture requirements.', + runtime_capture_budget_exceeded: 'Reduce the captured text/artifact scope, split the proof, or raise the planned budget with rationale before claiming the slot is satisfied.', + runtime_capture_fallback_missing_reason: 'Record why the selected browser provider differed from the planned preference and narrow the claim as needed.', + missing_runtime_capture_provider: 'Record the runtime capture provider selected for this proof bundle.', }; return hints[code] || 'Fix the proof issue, rerun the comparison, and keep the slot partial until evidence matches the plan.'; } @@ -481,6 +691,101 @@ function decorateComparisonIssue(issue) { }; } +function captureSlotIds(capture) { + return normalizeArray(capture?.slot_ids || capture?.slotIds); +} + +function slotCaptures(bundle, slotIdValue) { + return normalizeArray(bundle?.runtime_capture?.captures || bundle?.runtimeCapture?.captures) + .filter(isPlainObject) + .filter((capture) => captureSlotIds(capture).includes(slotIdValue)); +} + +function runtimeCaptureProvider(bundle) { + return bundle?.runtime_capture?.provider || bundle?.runtimeCapture?.provider; +} + +function runtimeCaptureProviderSelected(bundle) { + const provider = runtimeCaptureProvider(bundle); + return provider?.selected; +} + +function runtimeCaptureFallbackReason(bundle) { + const provider = runtimeCaptureProvider(bundle); + return provider?.fallback_reason || provider?.fallbackReason; +} + +function runtimeCaptureBudgetTotal(captures, budgetField) { + const metric = RUNTIME_CAPTURE_BUDGET_FIELD_MAP[budgetField]; + if (!metric) return 0; + if (metric === 'screenshot_count') { + const explicitCount = captures.reduce((sum, capture) => sum + (typeof capture.screenshot_count === 'number' ? capture.screenshot_count : 0), 0); + return explicitCount > 0 ? explicitCount : captures.filter((capture) => capture.mode === 'screenshot').length; + } + return captures.reduce((sum, capture) => sum + (typeof capture[metric] === 'number' ? capture[metric] : 0), 0); +} + +function compareRuntimeCaptureRequirements(slot, slotIdValue, bundle, issues) { + const requirements = runtimeCaptureRequirements(slot); + if (!hasValue(requirements)) return; + + const captures = slotCaptures(bundle, slotIdValue); + if (!hasValue(bundle?.runtime_capture || bundle?.runtimeCapture) || captures.length === 0) { + issues.push({ + code: 'missing_runtime_capture', + path: 'runtime_capture', + message: `Observed UI proof for slot ${slotIdValue} is missing runtime_capture metadata linked to the slot.`, + }); + return; + } + + const passedModes = new Set(captures.filter((capture) => capture.result === 'passed').map((capture) => capture.mode).filter(Boolean)); + for (const mode of normalizeArray(requirements.required_modes || requirements.requiredModes)) { + if (!passedModes.has(mode)) { + issues.push({ + code: 'missing_runtime_capture_mode', + path: 'runtime_capture.captures[].mode', + message: `Observed UI proof for slot ${slotIdValue} is missing required runtime capture mode: ${mode}.`, + }); + } + } + + const budgets = requirements.budgets; + if (isPlainObject(budgets)) { + for (const [field, max] of Object.entries(budgets)) { + if (!Object.prototype.hasOwnProperty.call(RUNTIME_CAPTURE_BUDGET_FIELD_MAP, field)) continue; + if (typeof max !== 'number' || !Number.isFinite(max)) continue; + const total = runtimeCaptureBudgetTotal(captures, field); + if (total > max) { + issues.push({ + code: 'runtime_capture_budget_exceeded', + path: `runtime_capture.captures.${field}`, + message: `Observed UI proof for slot ${slotIdValue} exceeds runtime capture budget ${field}: ${total} > ${max}.`, + }); + } + } + } + + const providerPreference = normalizeArray(requirements.provider_preference || requirements.providerPreference); + if (providerPreference.length === 0) return; + const selected = runtimeCaptureProviderSelected(bundle); + if (!selected) { + issues.push({ + code: 'missing_runtime_capture_provider', + path: 'runtime_capture.provider.selected', + message: `Observed UI proof for slot ${slotIdValue} does not record the selected runtime capture provider.`, + }); + return; + } + if (!providerPreference.includes(selected) && !hasValue(runtimeCaptureFallbackReason(bundle))) { + issues.push({ + code: 'runtime_capture_fallback_missing_reason', + path: 'runtime_capture.provider.fallback_reason', + message: `Observed UI proof for slot ${slotIdValue} selected runtime capture provider ${selected} outside the planned preference without fallback rationale.`, + }); + } +} + function compareSlotToBundle(slot, slotIdValue, observed) { const issues = []; const bundle = observed.bundle; @@ -682,6 +987,8 @@ function compareSlotToBundle(slot, slotIdValue, observed) { }); } + compareRuntimeCaptureRequirements(slot, slotIdValue, bundle, issues); + const status = issues.length === 0 ? 'satisfied' : (bundleStatus === 'missing' ? 'missing' : 'partial'); return { status, issues: issues.map(decorateComparisonIssue), source: observed.source }; } @@ -769,6 +1076,7 @@ export function validateUiProofBundle(bundle, options = {}) { validatePublicObservationPrivacy(bundle, errors, publicClaim); const artifactRefs = validateArtifacts(bundle, errors, publicClaim, options); validateObservationArtifactRefs(bundle, artifactRefs, errors); + validateRuntimeCapture(bundle, artifactRefs, errors); return { valid: errors.length === 0, errors, warnings }; } @@ -1004,4 +1312,8 @@ export { COMPARISON_STATUSES as UI_PROOF_COMPARISON_STATUSES, EVIDENCE_KINDS as UI_PROOF_EVIDENCE_KINDS, RAW_ARTIFACT_TYPES as UI_PROOF_RAW_ARTIFACT_TYPES, + RUNTIME_CAPTURE_AVAILABILITY_STATUSES as UI_PROOF_RUNTIME_CAPTURE_AVAILABILITY_STATUSES, + RUNTIME_CAPTURE_BUDGET_FIELD_MAP as UI_PROOF_RUNTIME_CAPTURE_BUDGET_FIELD_MAP, + RUNTIME_CAPTURE_METRIC_FIELDS as UI_PROOF_RUNTIME_CAPTURE_METRIC_FIELDS, + RUNTIME_CAPTURE_MODES as UI_PROOF_RUNTIME_CAPTURE_MODES, }; diff --git a/distilled/DESIGN.md b/distilled/DESIGN.md index b3fd95c..a54f3fa 100644 --- a/distilled/DESIGN.md +++ b/distilled/DESIGN.md @@ -2818,11 +2818,12 @@ Posture compatibility is part of that closeout contract: `repo_closeout` and `ru ## D62 - Repo-Native UI Proof Contract -**Decision (2026-04-28; revised 2026-05-09):** UI-sensitive work should carry a compact planned proof-slot contract and, when executed, an observed UI proof bundle that references artifacts by path or link while preserving the existing closure evidence kinds: `code`, `test`, `runtime`, `delivery`, and `human`. For live rendered UI proof, `agent-browser` is the default runtime evidence path for consumers, while existing Playwright tests remain the canonical repeatable browser-regression path when present. The deterministic `ui-proof` validator remains provider-agnostic structural validation, but it now validates planned slot specificity, concise tool provenance, local artifact path existence when validating from files, raw-artifact safety for paths and URLs, and failed/partial proof classification so the workflow cannot degrade back into unstructured "looks good" review. Direct phase verification also treats plan frontmatter as the UI-proof declaration authority and fails closed on missing phase prerequisites, empty `ui_proof_slots: []` without `no_ui_proof_rationale`, and invalid required UI proof. +**Decision (2026-04-28; revised 2026-05-09 and 2026-06-08):** UI-sensitive work should carry a compact planned proof-slot contract and, when executed, an observed UI proof bundle that references artifacts by path or link while preserving the existing closure evidence kinds: `code`, `test`, `runtime`, `delivery`, and `human`. For live rendered UI proof, `agent-browser` is the default runtime evidence path for consumers, while existing Playwright tests remain the canonical repeatable browser-regression path when present. The deterministic `ui-proof` validator remains provider-agnostic structural validation, but it now validates planned slot specificity, optional runtime capture benchmark annotations, concise tool provenance, local artifact path existence when validating from files, raw-artifact safety for paths and URLs, and failed/partial proof classification so the workflow cannot degrade back into unstructured "looks good" review. Direct phase verification also treats plan frontmatter as the UI-proof declaration authority and fails closed on missing phase prerequisites, empty `ui_proof_slots: []` without `no_ui_proof_rationale`, and invalid required UI proof. **Context:** - UI proof targets the recurring failure mode where agents claim a UI works or looks good without rendered proof, matched observations, or explicit human judgment. - The contract defines proof slots, proof bundles, comparison statuses, fail-closed agent guardrails, deterministic metadata validation, privacy metadata, and health visibility without adding a browser-provider framework. +- Browser proof benchmark annotations extend this contract by recording provider choice, capture modes, cost, fidelity, and fallback reason as metadata; they do not add a browser sidecar, live direct-CDP implementation, or new evidence kind. - GSD's archived planner, executor, and verifier roles preserve strong lifecycle discipline, but they do not provide this UI-specific planned-vs-observed proof model. GSDD keeps the lifecycle leverage and adds a repo-native UI proof substrate without adding a browser-provider framework. - OneShot's QC guidance and Vercel's `agent-browser` skill converge on an interactive browser loop for snapshots, ref-based interaction, screenshots, and network/console-adjacent inspection. GSDD adapts that as a default workflow instruction, not as a hard validator dependency. @@ -2833,24 +2834,30 @@ Posture compatibility is part of that closeout contract: `repo_closeout` and `ru - When an explicit no-UI rationale exists, stale UI-proof sidecars are warning-level cleanup signals, not proof and not blockers. - Planned slots record claim, route/state, required evidence kinds, minimum observations, expected artifact types, runnable validation command, environment/viewport, manual-acceptance requirement, claim limit, and requirement IDs. - Observed proof bundles record claim, requirement/slot IDs, route/state, environment, viewport, evidence inputs, commands/manual steps, observations, artifacts, privacy metadata, result, and claim limits. +- Planned slots may include optional `runtime_capture_requirements` with provider preference, fallback policy, required/optional capture modes, and budgets when provider choice, fidelity, or capture cost must be benchmarked. +- Observed proof bundles may include optional `runtime_capture` metadata with selected provider, fallback chain, availability, captures, metrics, artifact refs, and fidelity limits. Existing bundles without these fields remain valid. +- Runtime capture vocabulary is stable and bounded: modes cover screenshot, interactive/accessibility snapshots, scoped DOM, selected-element DOM, computed style, console/network deltas, framework state, and manual observation; availability statuses are `available`, `unavailable`, `not_configured`, `skipped`, and `failed`; budgets cover text bytes, estimated tokens, raw artifact bytes, screenshot count, computed-style property count, and console/network event counts. - Planned slots must be tight enough for the plan checker to reject vague proof: specific route/state, viewport rationale or narrowed claim limit, minimum observations, expected artifact types, runnable validation, and matchability back to the exact UI claim. - The planner chooses viewport coverage, but responsive or layout-sensitive claims require desktop/mobile or equivalent state coverage unless the claim is explicitly narrowed. - Execution defaults to `agent-browser` for live UI runtime proof: open the route/state, capture interactive snapshots/refs where relevant, exercise the changed flow, capture screenshots for planned viewport(s), and record relevant console/network observations. - Existing Playwright tests or package scripts remain the canonical repeatable browser-regression evidence when present. Playwright scripting is reserved for checks `agent-browser` cannot cover cleanly, such as JS-disabled behavior, structured console listeners, or multi-context testing. +- Direct-CDP is an escalation path for selected DOM/CSS/computed-style, console, network, or framework-state proof, not the default. Chrome DevTools MCP and Playwright MCP are optional only when already configured and scoped to the claim. This decision does not add browser tooling, browser installs, CI, Storybook, visual-regression infrastructure, or a live direct-CDP capture implementation. - Verification compares planned slots to observed bundles using `satisfied`, `partial`, `missing`, `waived`, `deferred`, and `not_applicable`; waiver and deferral are not proof. +- Verification compares planned `runtime_capture_requirements` to observed `runtime_capture` only when a slot opts in; missing required modes, missing selected provider, over-budget captures, or unexplained provider fallback keep the slot partial or missing. - UI correctness claims fail closed unless rendered proof is matched exactly to claim, route/state, observation, evidence kind, artifact path or manual step, privacy metadata, result, and claim limit, or an explicit waiver/deferment narrows the claim. - Human acceptance may close a narrowed claim and record proof debt, but it must not convert missing or mismatched non-human evidence into `satisfied` proof. - Screenshots, traces, videos, reports, accessibility scans, Gherkin, and visual diffs are artifact types or activities mapped onto the five existing evidence kinds, not new evidence kinds. - Source annotations, AST/cAST findings, semantic search hits, comments, and Semble-like retrieval may discover proof obligations, but they are discovery hints only and do not satisfy proof slots. - Visual taste, accessibility judgment, baseline acceptance, subjective polish/layout quality, and privacy publication require human evidence or explicit waiver, and human approval does not replace required `code`, `test`, `runtime`, or `delivery` evidence. - Deterministic validation keeps the evidence and comparison-status vocabularies unchanged: planned slots require specific claim, route/state, evidence, expected artifacts, validation, viewport, and claim-limit fields; artifact entries require `visibility`, `retention`, `sensitivity`, and `safe_to_publish`; raw screenshots, traces, videos, DOM snapshots, and reports default to `local_only` plus `safe_to_publish: false`; `bin/lib/ui-proof.mjs` validates required bundle/observation fields, structured command/manual-step entries, fixed evidence kinds, concise `tools_used` IDs, claim/result statuses, comparison statuses, failure classification for failed/partial proof, claim limits, privacy metadata, safe artifact references, local artifact path existence when validating file-backed bundles, and public/tracked/delivery proof claims backed by local-only, unsafe, unsanitized, or privacy-contradictory artifacts. +- Raw screenshots, traces, videos, DOM, reports, console/network logs, and framework-state captures remain local-only or summarized by default; validator behavior remains metadata-focused and does not inspect raw pixels, raw DOM, or provider-specific payloads. - `gsdd health` reports invalid known UI proof bundles as E10 using the same validator, staying read-only and avoiding raw artifact content inspection. - Failed UI proof is reported through existing GSDD gap/proof-debt language. Product behavior defects, missing or blocked infrastructure, flaky harnesses, and ambiguous specs explain causes, but they do not add new evidence kinds, result statuses, or comparison statuses. **Leverage:** - Lost: UI-sensitive work now carries a small proof-contract burden, and default live proof guidance adds slightly more specificity for planners/checkers to enforce. - Kept: repo-native markdown artifacts, optional project tooling, fixed closure evidence kinds, generated-surface freshness, the plan/execute/verify separation, and provider-agnostic deterministic metadata validation. -- Gained: exact claim-to-proof traceability, strict comparison statuses, privacy and claim-limit metadata, fail-closed overclaim guardrails, deterministic metadata validation, a concrete live browser evidence path, and health-visible protection against unsafe public proof claims. +- Gained: exact claim-to-proof traceability, strict comparison statuses, privacy and claim-limit metadata, fail-closed overclaim guardrails, deterministic metadata validation, a concrete live browser evidence path, benchmarkable browser capture annotations, and health-visible protection against unsafe public proof claims. **Evidence:** - `distilled/templates/ui-proof.md` @@ -2858,6 +2865,7 @@ Posture compatibility is part of that closeout contract: `repo_closeout` and `ru - `agents/planner.md`, `agents/executor.md`, `agents/verifier.md`, `distilled/templates/delegates/plan-checker.md` - `bin/lib/templates.mjs`, `bin/lib/ui-proof.mjs`, `bin/lib/health.mjs`, `bin/lib/phase.mjs`, `bin/lib/rendering.mjs` - `tests/phase.test.cjs`, `tests/gsdd.guards.test.cjs`, `tests/gsdd.health.test.cjs`, `tests/gsdd.init.test.cjs` +- `docs/plans/2026-06-08-001-feat-browser-proof-benchmark-plan.md`, `fixtures/ui-proof/browser-runtime-capture-slots.json`, `fixtures/ui-proof/browser-runtime-capture-bundle.json` - GSD comparison: the upstream planner, executor, and verifier role patterns preserve lifecycle rigor, but they do not define UI proof slots or planned-vs-observed UI proof bundles. - OneShot QC source: `https://github.com/oneshot-repo/OneShot/tree/main/skills` - Vercel `agent-browser` docs: `https://github.com/vercel-labs/agent-browser/blob/main/skill-data/core/SKILL.md` and `https://agent-browser.dev/snapshots` diff --git a/distilled/EVIDENCE-INDEX.md b/distilled/EVIDENCE-INDEX.md index 7839dc5..43a827c 100644 --- a/distilled/EVIDENCE-INDEX.md +++ b/distilled/EVIDENCE-INDEX.md @@ -491,6 +491,8 @@ - `agents/planner.md`, `agents/executor.md`, `agents/verifier.md`, `distilled/templates/delegates/plan-checker.md` - `bin/lib/templates.mjs`, `bin/lib/ui-proof.mjs`, `bin/lib/health.mjs`, `bin/lib/rendering.mjs` - `tests/phase.test.cjs`, `tests/gsdd.guards.test.cjs`, `tests/gsdd.health.test.cjs`, `tests/gsdd.init.test.cjs` +- `docs/plans/2026-06-08-001-feat-browser-proof-benchmark-plan.md` +- `fixtures/ui-proof/browser-runtime-capture-slots.json`, `fixtures/ui-proof/browser-runtime-capture-bundle.json` - OneShot QC/browser policy: https://github.com/oneshot-repo/OneShot/tree/main/skills - Vercel agent-browser docs: https://github.com/vercel-labs/agent-browser/blob/main/skill-data/core/SKILL.md, https://agent-browser.dev/snapshots - Playwright browser proof docs: https://playwright.dev/docs/trace-viewer, https://playwright.dev/docs/next/screenshots, https://playwright.dev/mcp/tools/tracing @@ -502,7 +504,7 @@ - Agent-hostile UI semantics source: https://dev.to/ratikkoka/your-ui-is-invisible-to-ai-agents-heres-how-to-fix-it-1ib3 - Browser-harness efficiency source: https://dev.to/louaiboumediene/the-ai-harness-why-your-ai-coding-agent-is-only-as-smart-as-the-repo-you-put-it-in-cml - Dynamic-UI harness brittleness source: https://tessl.io/blog/webmcp-making-web-apps-faster-and-cheaper-for-ai-agents/ -- Long-term pitfalls carried forward: do not accept screenshot-free "looks good" claims, weak planned slots, unverified artifact paths, stale interactive refs after page mutation, partial/failed proof without failure classification, raw artifact publication without privacy metadata, browser contention in parallel checks, or semantic/selector-poor UI that forces fragile coordinate inspection. +- Long-term pitfalls carried forward: do not accept screenshot-free "looks good" claims, weak planned slots, unverified artifact paths, stale interactive refs after page mutation, unbudgeted browser snapshots, unexplained provider fallback, partial/failed proof without failure classification, raw artifact publication without privacy metadata, browser contention in parallel checks, or semantic/selector-poor UI that forces fragile coordinate inspection. - Supporting spec/runtime docs: https://openspec.dev/, https://www.lean-spec.dev/docs/guide/first-principles, https://help.openai.com/en/articles/11369540-codex-in-chatgpt, https://docs.claude.com/en/docs/agents-and-tools/agent-skills, https://docs.github.com/en/copilot/concepts/prompting/response-customization ## D63 — Computed-First Control Map diff --git a/distilled/templates/delegates/plan-checker.md b/distilled/templates/delegates/plan-checker.md index 2fbc0e4..8988223 100644 --- a/distilled/templates/delegates/plan-checker.md +++ b/distilled/templates/delegates/plan-checker.md @@ -36,6 +36,7 @@ Verify these dimensions: - `closure_honesty`: the plan's done criteria and evidence limits support only claims that execution can actually prove. - `closure_honesty`: for UI proof, reject agent-only `looks good` closure, artifact-count proof, unsupported evidence kinds, and human acceptance that converts missing/mismatched non-human evidence into `satisfied` proof. Waiver, deferment, proof debt, or narrowed-claim language is acceptable only when the stronger UI claim is not treated as proven. - `closure_honesty`: for UI proof planning, reject weak slots that omit a specific route/state, viewport rationale or narrowed viewport claim limit, minimum observations, expected artifact types, runnable validation, or a way to compare observed proof back to the planned claim. Treat under-specified viewport coverage as a blocker for responsive or layout-sensitive claims. `agent-browser` is the default live runtime evidence path; do not block a slot solely for using another project-native browser path, but require the plan to explain the `agent-browser` availability constraint and fallback choice. +- `closure_honesty`: for runtime capture benchmarks, accept optional `runtime_capture_requirements` only when they stay provider-neutral, name required/optional modes, include budgets or an explicit budget rationale, keep `agent-browser` first, and treat direct-CDP as selected DOM/CSS/computed-style escalation. Chrome DevTools MCP and Playwright MCP must be optional only when already configured; reject plans that install browser tooling, browser MCP, CI, Storybook, or visual-regression infrastructure merely to satisfy capture annotations. - `closure_honesty`: for UI proof privacy, require artifact `visibility`, `retention`, `sensitivity`, and `safe_to_publish`, require `gsdd ui-proof validate` or `gsdd health` when bundle metadata exists, and reject public/tracked/delivery/publication proof claims backed by local-only or `safe_to_publish: false` artifacts. - `high_leverage_review`: high-leverage surfaces have a second-pass review or equivalent contradiction/staleness check before completion. - `approach_alignment`: when APPROACH.md is provided, verify that plan tasks implement the chosen approaches from the user's decisions. Check: diff --git a/distilled/templates/ui-proof.md b/distilled/templates/ui-proof.md index 26ca548..a91e329 100644 --- a/distilled/templates/ui-proof.md +++ b/distilled/templates/ui-proof.md @@ -34,6 +34,17 @@ ui_proof_slots: notes: "State why this viewport is enough for the claim, or add separate slots/observations for mobile, desktop, or responsive states." manual_acceptance_required: false claim_limit: "Does not prove cross-browser layout, full accessibility conformance, production delivery, or unrelated UI states." + runtime_capture_requirements: + provider_preference: [agent-browser, direct-cdp] + fallback_policy: "record_availability_and_narrow_claim" + required_modes: [screenshot, interactive_snapshot] + optional_modes: [selected_element_dom, computed_style, console_delta, network_delta, framework_state] + budgets: + text_bytes_max: 24000 + estimated_tokens_max: 6000 + raw_artifact_bytes_max: 5000000 + screenshot_count_max: 4 + computed_style_properties_max: 80 no_ui_proof_rationale: null ``` @@ -44,6 +55,7 @@ Slot rules: - The planner chooses the viewport set, but the slot must explain the choice. Include desktop and mobile proof when the claim covers responsive layout or when the changed surface is likely to behave differently across those sizes; otherwise narrow the claim limit. - Source annotations, AST/cAST findings, semantic search hits, comments, and Semble-like retrieval may help discover proof obligations. They are discovery hints only; they do not satisfy proof slots. - Do not add Playwright, Cypress, Storybook, Cucumber, CI, browser MCP, or visual-regression tooling by default. +- Add optional `runtime_capture_requirements` only when the slot needs benchmarkable browser-provider or capture-cost proof; omit it for ordinary UI proof that can be compared by the base slot fields. - Human approval is required for visual taste, accessibility judgment, baseline acceptance, subjective polish/layout quality, and privacy publication decisions. - Human approval does not replace required non-human evidence when the slot requires `code`, `test`, `runtime`, or `delivery` evidence. @@ -138,6 +150,49 @@ Replace placeholders such as `{work_item_dir}` with the current phase, quick-tas "notes": "Local screenshot only; not public proof unless sanitized and reclassified." } ], + "runtime_capture": { + "provider": { + "primary": "agent-browser", + "selected": "agent-browser", + "fallback_chain": ["agent-browser", "direct-cdp", "chrome-devtools-mcp", "playwright-mcp", "manual"], + "fallback_reason": null, + "availability": [ + { "provider": "agent-browser", "status": "available" } + ] + }, + "captures": [ + { + "mode": "screenshot", + "slot_ids": ["ui-01"], + "artifact_refs": ["{work_item_dir}/artifacts/example-1280.png"], + "latency_ms": 420, + "raw_bytes": 184224, + "text_bytes": 0, + "estimated_tokens": 0, + "screenshot_count": 1, + "token_estimate_method": "not_applicable", + "result": "passed" + }, + { + "mode": "interactive_snapshot", + "slot_ids": ["ui-01"], + "latency_ms": 180, + "raw_bytes": 0, + "text_bytes": 2200, + "estimated_tokens": 550, + "token_estimate_method": "rough_char_div_4", + "result": "passed" + } + ], + "fidelity": { + "sees_pixels": true, + "includes_accessibility_tree": true, + "includes_dom_subset": false, + "includes_computed_styles": false, + "includes_framework_state": false, + "claim_limits": ["No selected-element computed style capture was required for this slot."] + } + }, "privacy": { "data_classification": "synthetic", "redactions": [], @@ -176,9 +231,25 @@ Bundle rules: - Quick-mode UI proof should use deterministic synthetic IDs such as `quick-001` and `quick-001-ui-01` when roadmap requirement IDs do not exist. - Classify failed UI proof using existing GSDD gap/proof-debt language: `product_bug`, `missing_infra`, `flaky_harness`, or `ambiguous_spec`. Do not add new result statuses or evidence kinds for those causes. +## Runtime Capture Benchmarks + +Use `runtime_capture_requirements` on planned slots and `runtime_capture` on observed bundles to benchmark browser evidence only when the claim needs provider choice, capture fidelity, or cost to be measurable. These fields are optional and metadata-only; existing proof bundles remain valid without them. + +Provider chain: +- Default live UI proof remains `agent-browser`. +- Use direct-CDP only as an explicit escalation for selected-element DOM, CSS, computed-style, console, network, or framework-state claims that the default path cannot prove cleanly. +- Chrome DevTools MCP and Playwright MCP are optional only when already configured, scoped to the claim, and recorded as fallback metadata. +- Do not add browser tooling, browser installs, CI, Storybook, browser MCP, or visual-regression infrastructure just to fill these fields. + +Stable capture modes are `screenshot`, `interactive_snapshot`, `accessibility_snapshot`, `dom_subset`, `selected_element_dom`, `computed_style`, `console_delta`, `network_delta`, `framework_state`, and `manual_observation`. Provider availability statuses are `available`, `unavailable`, `not_configured`, `skipped`, and `failed`. + +Budget fields are `text_bytes_max`, `estimated_tokens_max`, `raw_artifact_bytes_max`, `screenshot_count_max`, `computed_style_properties_max`, `console_event_count_max`, and `network_event_count_max`. The comparator enforces them only when a planned slot declares them. + +Keep raw screenshots, traces, videos, DOM, reports, console/network logs, and framework-state captures referenced as local artifacts or summarized metadata; do not inline raw sensitive state. Claims that a research, deepening, or document-review pass used a pinned model such as `gpt-5.4-high` need runtime model-routing evidence before the proof bundle can claim that review actually ran. + ## Deterministic Validation -Use `gsdd ui-proof validate ` on JSON proof-bundle metadata or markdown fenced JSON before relying on a bundle for closure; add `--claim ` only when validating that stronger proof use. Use `gsdd ui-proof compare [observed-bundle-json ...]` when verifying planned proof slots against observed bundles through the deterministic product-facing path. Required planned-slot fields are `slot_id`, `claim`, `route_state`, `required_evidence_kinds`, `minimum_observations`, `expected_artifact_types`, `validation_command`, `environment`, `viewport`, `manual_acceptance_required`, and `claim_limit`. Required observed-bundle top-level fields are `proof_bundle_version`, `scope`, `route_state`, `environment`, `viewport`, `evidence_inputs`, `commands_or_manual_steps`, `observations`, `artifacts`, `privacy`, `result`, and `claim_limits`. The validator checks planned-slot specificity, required bundle and observation fields, structured command/manual-step entries, fixed evidence kinds, concise `tools_used` IDs, `result.claim_status`, observation `result`, comparison statuses, failure classification for failed/partial proof, non-empty claim limits, locked artifact and observation privacy fields, observation-to-artifact references, workspace-relative/http(s) artifact references, existing local artifact paths when validating from files, and explicit public/tracked/delivery proof claims that rely on local-only, unsafe, unsanitized, or privacy-contradictory artifacts. `claim_status`, observation `result`, and command/manual-step `result` use `passed`, `failed`, `partial`, `waived`, `deferred`, or `not_applicable`; failed/partial proof uses `product_bug`, `missing_infra`, `flaky_harness`, or `ambiguous_spec`. It does not inspect raw screenshot, trace, video, DOM, or report contents and does not require any specific browser provider such as `agent-browser`. +Use `gsdd ui-proof validate ` on JSON proof-bundle metadata or markdown fenced JSON before relying on a bundle for closure; add `--claim ` only when validating that stronger proof use. Use `gsdd ui-proof compare [observed-bundle-json ...]` when verifying planned proof slots against observed bundles through the deterministic product-facing path. Required planned-slot fields are `slot_id`, `claim`, `route_state`, `required_evidence_kinds`, `minimum_observations`, `expected_artifact_types`, `validation_command`, `environment`, `viewport`, `manual_acceptance_required`, and `claim_limit`. Required observed-bundle top-level fields are `proof_bundle_version`, `scope`, `route_state`, `environment`, `viewport`, `evidence_inputs`, `commands_or_manual_steps`, `observations`, `artifacts`, `privacy`, `result`, and `claim_limits`. Optional `runtime_capture_requirements` and `runtime_capture` metadata is validated when present and compared only for slots that opt in. The validator checks planned-slot specificity, runtime capture modes, provider availability statuses, budget metric fields, required bundle and observation fields, structured command/manual-step entries, fixed evidence kinds, concise `tools_used` IDs, `result.claim_status`, observation `result`, comparison statuses, failure classification for failed/partial proof, non-empty claim limits, locked artifact and observation privacy fields, observation-to-artifact references, workspace-relative/http(s) artifact references, existing local artifact paths when validating from files, and explicit public/tracked/delivery proof claims that rely on local-only, unsafe, unsanitized, or privacy-contradictory artifacts. `claim_status`, observation `result`, runtime capture `result`, and command/manual-step `result` use `passed`, `failed`, `partial`, `waived`, `deferred`, or `not_applicable`; failed/partial proof uses `product_bug`, `missing_infra`, `flaky_harness`, or `ambiguous_spec`. It does not inspect raw screenshot, trace, video, DOM, or report contents and does not require any specific browser provider such as `agent-browser`. ## Comparison Statuses diff --git a/distilled/workflows/execute.md b/distilled/workflows/execute.md index 7727a16..cd0639c 100644 --- a/distilled/workflows/execute.md +++ b/distilled/workflows/execute.md @@ -166,7 +166,7 @@ Before reporting a task complete: ### UI Proof Execution If the plan defines non-empty `ui_proof_slots`, create or update the observed UI proof bundle before claiming completion; required top-level fields are `proof_bundle_version`, `scope`, `route_state`, `environment`, `viewport`, `evidence_inputs`, `commands_or_manual_steps`, `observations`, `artifacts`, `privacy`, `result`, and `claim_limits`. -Use `agent-browser` as the default live UI proof path. Record the planned route/state open, interactive snapshots/refs when interaction is part of the claim, changed-flow interaction, screenshots for planned viewport(s), and relevant console/network observations. If `agent-browser` is unavailable, record the availability constraint and the closest project-native interactive browser fallback in the proof bundle instead of silently treating the fallback as the default path. If the repo already has Playwright tests or a package script wrapping them, run the relevant targeted test as canonical repeatable regression evidence; keep `agent-browser` as complementary runtime proof. Use Playwright scripting only for checks `agent-browser` cannot cover cleanly, such as JS-disabled, structured console, or multi-context verification. Do not install Playwright, Cypress, Cucumber, Storybook, browser MCP, CI, or visual-regression tooling by default. Screenshots, traces, videos, reports, accessibility scans, Gherkin, visual diffs, and manual notes map onto existing evidence kinds, not new evidence kinds; reference raw artifacts by path/link instead of storing them inline. +Use `agent-browser` as the default live UI proof path. Record the planned route/state open, interactive snapshots/refs when interaction is part of the claim, changed-flow interaction, screenshots for planned viewport(s), and relevant console/network observations. If `agent-browser` is unavailable, record the availability constraint and the closest project-native interactive browser fallback in the proof bundle instead of silently treating the fallback as the default path. If the repo already has Playwright tests or a package script wrapping them, run the relevant targeted test as canonical repeatable regression evidence; keep `agent-browser` as complementary runtime proof. Use Playwright scripting only for checks `agent-browser` cannot cover cleanly, such as JS-disabled, structured console, or multi-context verification. Do not install Playwright, Cypress, Cucumber, Storybook, browser MCP, CI, or visual-regression tooling by default. Screenshots, traces, videos, reports, accessibility scans, Gherkin, visual diffs, and manual notes map onto existing evidence kinds, not new evidence kinds; reference raw artifacts by path/link instead of storing them inline. If a planned slot declares `runtime_capture_requirements`, record optional `runtime_capture` metadata with the selected provider, fallback chain, availability, capture modes, latency, text/raw byte counts, estimated tokens, screenshot counts, bounded computed-style/property counts, fidelity limits, and artifact refs. Direct-CDP is an escalation path for selected DOM/CSS/computed-style, console, network, or framework-state proof; Chrome DevTools MCP and Playwright MCP are optional only when already configured and scoped. Keep raw screenshots, traces, videos, DOM, reports, console/network logs, and framework state local-only or summarized unless explicitly sanitized. Each artifact entry must include `visibility`, `retention`, `sensitivity`, and `safe_to_publish`; raw screenshots, traces, videos, DOM snapshots, and reports default to `local_only` and `safe_to_publish: false` unless explicitly sanitized. Use `gsdd ui-proof validate ` when bundle metadata exists, adding `--claim <...>` only when relying on the bundle for public, tracked, delivery, release, or publication proof. Visual taste, accessibility judgment, baseline acceptance, subjective polish/layout quality, and privacy publication decisions require human evidence or explicit waiver; artifact count, source comments, AST/cAST findings, semantic search, and Semble-like retrieval are not proof. If evidence does not match the slot claim, route/state, observation, artifact path/manual step, privacy metadata, result, and claim limit, record proof debt, waiver, deferment, or reduced claim language rather than `satisfied` proof. Classify failed UI proof using existing gap/proof-debt language: `product_bug`, `missing_infra`, `flaky_harness`, or `ambiguous_spec`. Do not add new evidence kinds or result statuses for those causes. diff --git a/distilled/workflows/plan.md b/distilled/workflows/plan.md index f921de5..fb06533 100644 --- a/distilled/workflows/plan.md +++ b/distilled/workflows/plan.md @@ -138,7 +138,7 @@ If any of these are missing or contradictory, STOP. Report the exact missing con For UI-sensitive work, include compact `ui_proof_slots` with `slot_id`, optional `requirement_id`, `claim`, `route_state`, fixed evidence kinds (`code`, `test`, `runtime`, `delivery`, `human`), `minimum_observations`, `expected_artifact_types`, `validation_command`, `environment`, `viewport`, `manual_acceptance_required`, and `claim_limit`; otherwise set `no_ui_proof_rationale`. Do not create slots for backend-only, CLI-only, docs-only, or refactor-only work unless the plan claims a visible UI outcome. Evidence must later match claim, route/state, observation, artifact path, evidence kind, privacy metadata, result, and claim limit; local-only or unsafe artifacts cannot support public, publication, tracked, delivery, or release proof claims. Human approval does not replace required `code`, `test`, `runtime`, or `delivery` evidence. -For live rendered UI proof, plan `agent-browser` as the default runtime evidence path and existing Playwright/package-script browser tests as the repeatable regression path when the repo already has them. If the runtime does not provide `agent-browser`, require the plan to state that availability constraint and name the closest project-native interactive browser fallback before narrowing the claim. The planner chooses the viewport set, but each slot must explain why the chosen viewport(s) are enough for the claim or narrow the claim limit; responsive claims need desktop/mobile or equivalent state coverage. Do not plan new browser infrastructure by default, and use Playwright scripting only for checks `agent-browser` cannot cover cleanly, such as JS-disabled, structured console, or multi-context verification. +For live rendered UI proof, plan `agent-browser` as the default runtime evidence path and existing Playwright/package-script browser tests as the repeatable regression path when the repo already has them. If the runtime does not provide `agent-browser`, require the plan to state that availability constraint and name the closest project-native interactive browser fallback before narrowing the claim. The planner chooses the viewport set, but each slot must explain why the chosen viewport(s) are enough for the claim or narrow the claim limit; responsive claims need desktop/mobile or equivalent state coverage. Do not plan new browser infrastructure by default, and use Playwright scripting only for checks `agent-browser` cannot cover cleanly, such as JS-disabled, structured console, or multi-context verification. When provider choice, capture fidelity, or capture cost needs to be benchmarked, add optional `runtime_capture_requirements` to the slot with provider preference, fallback policy, required/optional modes, and budgets. Keep `agent-browser` first; direct-CDP is only an escalation for selected-element DOM/CSS/computed-style, console, network, or framework-state claims; Chrome DevTools MCP and Playwright MCP are optional only when already configured. Do not plan browser installs, browser MCP, CI, Storybook, or visual-regression infrastructure just to satisfy runtime capture annotations. Plan backward from success criteria. diff --git a/distilled/workflows/quick.md b/distilled/workflows/quick.md index 37a5b96..9d5f76c 100644 --- a/distilled/workflows/quick.md +++ b/distilled/workflows/quick.md @@ -122,6 +122,7 @@ Delegate to the planner role in quick mode. - UI proof slots must be matchable to exact observed evidence later: claim, route/state, observation, evidence kind, artifact path or manual step, privacy metadata, result, and claim limit. Discovery hints from source comments, AST/cAST, semantic search, or Semble-like retrieval do not satisfy proof. - Observed artifact metadata must include `visibility`, `retention`, `sensitivity`, and `safe_to_publish`; raw screenshots, traces, videos, DOM snapshots, and reports are local-only/unsafe by default. Use `gsdd ui-proof validate ` or `gsdd health` when a bundle exists; add `--claim <...>` only for public, publication, tracked, delivery, or release proof use. - For live rendered UI proof, default to `agent-browser` snapshots/refs, interactions, screenshots, and relevant console/network observations. If unavailable, state the availability constraint and closest project-native interactive browser fallback before narrowing the claim. Existing Playwright/package-script browser tests remain the canonical repeatable regression path when present. The viewport set is plan-owned, but under-specified viewport coverage is weak proof; explain the chosen viewport(s) or narrow the claim limit. +- Add optional `runtime_capture_requirements` only when provider choice, capture fidelity, or capture cost must be benchmarked; keep `agent-browser` first, direct-CDP as selected DOM/CSS/computed-style escalation, and Chrome DevTools MCP or Playwright MCP optional only when already configured. - Keep UI proof proportional: do not scaffold Playwright, Cypress, Cucumber, Storybook, CI, browser MCP, or visual-regression tooling by default - Ignore Step 1 requirement extraction; use inline goal-backward planning only - Target minimal context usage @@ -271,6 +272,7 @@ Delegate to the executor role. - Create summary at: `.planning/quick/$NEXT_NUM-$SLUG/$NEXT_NUM-SUMMARY.md` - If the quick plan defines `ui_proof_slots`, create or update `.planning/quick/$NEXT_NUM-$SLUG/UI-PROOF.md` with fenced JSON containing required top-level fields: `proof_bundle_version`, `scope`, `route_state`, `environment`, `viewport`, `evidence_inputs`, `commands_or_manual_steps`, `observations`, `artifacts`, `privacy`, `result`, and `claim_limits` - For live UI proof, record `agent-browser` in `evidence_inputs.tools_used` when used, the exact commands or manual ref-based steps, screenshot/report artifact paths, and any relevant console/network observations. If `agent-browser` was unavailable, record that availability constraint and fallback tool explicitly. If existing Playwright tests supplied regression evidence, record the package command and result separately from the `agent-browser` runtime observation. +- If the slot declared `runtime_capture_requirements`, add `runtime_capture` metadata with provider availability, capture modes, budgets/costs, fidelity limits, and artifact refs; missing modes, unexplained fallback, or over-budget captures keep the quick UI proof partial. - Human approval for visual taste, accessibility judgment, baseline acceptance, subjective polish/layout quality, or privacy publication does not replace required `code`, `test`, `runtime`, or `delivery` evidence **Output:** `.planning/quick/$NEXT_NUM-$SLUG/$NEXT_NUM-SUMMARY.md` diff --git a/distilled/workflows/verify.md b/distilled/workflows/verify.md index 615a654..9035f91 100644 --- a/distilled/workflows/verify.md +++ b/distilled/workflows/verify.md @@ -131,8 +131,8 @@ Note: this step does NOT replace levels 1–3. An artifact can satisfy the evide Before closure, direct `gsdd verify ` and this workflow must fail closed when the target phase has no matching PLAN.md or SUMMARY.md; report structured prerequisite blockers instead of treating missing artifacts as an empty success. Read UI proof declaration authority from the plan frontmatter contract only: body prose, fenced examples, stale sidecars, and markdown snippets do not declare UI proof intent. If frontmatter defines non-empty `ui_proof_slots`, compare planned UI proof against observed bundles before closure. Prefer `gsdd ui-proof compare [observed-bundle-json ...]` when planned slots are available as JSON or fenced JSON; otherwise perform the same field-by-field comparison and record reduced assurance if no deterministic command could run. If frontmatter records `ui_proof_slots: []`, it must also contain a nonblank `no_ui_proof_rationale`; otherwise verification blocks. If the plan records only `no_ui_proof_rationale`, verify the rationale instead of requiring a bundle, and treat stale planned/observed sidecars as warnings rather than proof or blockers. Each observed bundle must include top-level `proof_bundle_version`, `scope`, `route_state`, `environment`, `viewport`, `evidence_inputs`, `commands_or_manual_steps`, `observations`, `artifacts`, `privacy`, `result`, and `claim_limits`. -Classify each slot as exactly one of: `satisfied`, `partial`, `missing`, `waived`, `deferred`, or `not_applicable`. Deterministic comparison issues include `severity` and `fix_hint`; use those as the normal repair feedback loop before closing verification. Waiver/deferment narrows the claim; it is not proof. Screenshots, traces, videos, reports, accessibility scans, Gherkin, visual diffs, and manual notes are artifact types or activities mapped onto existing evidence kinds, not new evidence kinds. Artifact count is never proof; each artifact must tie to the slot claim, route/state, observation, artifact path/link, privacy metadata, and claim limit. -For live UI runtime proof, expect `agent-browser` as the default captured tool unless the observed bundle explains a project-native equivalent or an availability constraint. Do not fail solely because another browser tool was used, but downgrade vague proof that lacks exact route/state, planned viewport coverage or rationale, interactive steps/refs where relevant, screenshot/report artifacts, or relevant console/network observations. Existing Playwright tests count as canonical repeatable regression evidence, not a replacement for scoped runtime evidence when the slot requires `runtime`. +Classify each slot as exactly one of: `satisfied`, `partial`, `missing`, `waived`, `deferred`, or `not_applicable`. Deterministic comparison issues include `severity` and `fix_hint`; use those as the normal repair feedback loop before closing verification. Waiver/deferment narrows the claim; it is not proof. Screenshots, traces, videos, reports, accessibility scans, Gherkin, visual diffs, and manual notes are artifact types or activities mapped onto existing evidence kinds, not new evidence kinds. Artifact count is never proof; each artifact must tie to the slot claim, route/state, observation, artifact path/link, privacy metadata, and claim limit. If a planned slot includes `runtime_capture_requirements`, verify that observed `runtime_capture` metadata is linked to the slot, includes every required passed capture mode, stays within declared budgets, and records selected provider plus fallback rationale when the selected provider differs from preference. Missing runtime capture metadata, missing modes, over-budget captures, or unexplained fallback keep the slot `partial` or `missing`. +For live UI runtime proof, expect `agent-browser` as the default captured tool unless the observed bundle explains a project-native equivalent or an availability constraint. Do not fail solely because another browser tool was used, but downgrade vague proof that lacks exact route/state, planned viewport coverage or rationale, interactive steps/refs where relevant, screenshot/report artifacts, or relevant console/network observations. Existing Playwright tests count as canonical repeatable regression evidence, not a replacement for scoped runtime evidence when the slot requires `runtime`. Treat direct-CDP as an escalation for selected DOM/CSS/computed-style, console, network, or framework-state claims, not as the default live proof path. Chrome DevTools MCP and Playwright MCP count only when already configured, scoped to the claim, and recorded in the fallback chain; verification must not require installing browser tooling, CI, Storybook, or visual-regression infrastructure. Artifact privacy metadata must include `visibility`, `retention`, `sensitivity`, and `safe_to_publish`; raw screenshots, traces, videos, DOM snapshots, and reports default to local-only and unsafe unless sanitized. Run `gsdd ui-proof validate ` or treat `gsdd health` E10 as blocking; add `--claim <...>` when relying on the bundle for public, tracked, delivery, release, or publication proof. Visual taste, accessibility judgment, baseline acceptance, subjective polish/layout quality, and privacy publication require human evidence or explicit waiver; human approval does not replace required `code`, `test`, `runtime`, or `delivery` evidence. Source annotations, AST/cAST findings, semantic search, comments, and Semble-like retrieval are discovery hints only. diff --git a/docs/USER-GUIDE.md b/docs/USER-GUIDE.md index 45c14e8..8c1b26b 100644 --- a/docs/USER-GUIDE.md +++ b/docs/USER-GUIDE.md @@ -203,6 +203,8 @@ The 7 check dimensions: requirement coverage, task completeness, dependency corr | `npx -y gsdd-cli closeout-report [--json] [--phase ]` | Read-only closeout replay: blockers, warnings, fixes, and next safe action (composed from control-map, health/preflight, verify, and UI-proof signals) | | `npx -y gsdd-cli find-phase [N]` | Show phase info as JSON (for agent consumption) | | `npx -y gsdd-cli verify ` | Run artifact checks for phase N | +| `npx -y gsdd-cli ui-proof validate ` | Validate UI proof metadata, including optional browser runtime capture annotations | +| `npx -y gsdd-cli ui-proof compare [observed-bundle-json ...]` | Compare planned UI proof slots and runtime capture requirements against observed bundles | | `npx -y gsdd-cli scaffold phase [name]` | Create a new phase plan file | | `npx -y gsdd-cli models show` | Display effective model state across all runtimes | | `npx -y gsdd-cli models profile ` | Set global model profile (`quality`/`balanced`/`budget`) | @@ -213,6 +215,8 @@ The 7 check dimensions: requirement coverage, task completeness, dependency corr If `gsdd-cli` is globally installed, you can use the shorter `gsdd ...` form for the same commands. Generated workflow helper calls do not use the global binary; they run through `node .planning/bin/gsdd.mjs ...` from the repo root. +Runtime capture annotations are optional UI proof metadata for benchmarking provider choice, screenshot/snapshot modes, budgets, and fallback reasons. They do not install or require browser tooling by default; `agent-browser` remains the default live UI proof path, direct-CDP is an escalation for selected DOM/CSS/computed-style proof, and Chrome DevTools MCP or Playwright MCP should be recorded only when already configured. + Normal user flow: 1. Run `npx gsdd-cli init`. @@ -239,6 +243,8 @@ Other CLI commands that remain available outside the first-run path: |---------|---------| | `gsdd find-phase [N]` | Show phase info as JSON (for agent consumption) | | `gsdd verify ` | Run phase artifact and UI-proof closure checks for phase N; exits nonzero when verification is blocked | +| `gsdd ui-proof validate ` | Validate UI proof metadata, including optional browser runtime capture annotations | +| `gsdd ui-proof compare [observed-bundle-json ...]` | Compare planned UI proof slots and runtime capture requirements against observed bundles | | `gsdd scaffold phase [name]` | Create a new phase plan file | ### Platform flags for `--tools` diff --git a/docs/plans/2026-06-08-001-feat-browser-proof-benchmark-plan.md b/docs/plans/2026-06-08-001-feat-browser-proof-benchmark-plan.md new file mode 100644 index 0000000..24dca6a --- /dev/null +++ b/docs/plans/2026-06-08-001-feat-browser-proof-benchmark-plan.md @@ -0,0 +1,484 @@ +--- +title: Browser Proof Benchmark Annotations +type: feat +status: active +date: 2026-06-08 +origin: local goal handoff +--- + +# Browser Proof Benchmark Annotations + +## Overview + +Add benchmarkable browser-runtime capture annotations to Workspine's existing UI proof contract. The first implementation should not build a managed browser sidecar or a framework-specific Angular inspector. It should extend the provider-neutral proof metadata so execution and verification can record what browser evidence was collected, which provider path was used, what it cost, and what it could not prove. + +This keeps the first slice aligned with the current Workspine architecture: `agent-browser` remains the default live UI proof path, provider-specific tools remain optional, raw artifacts stay local-only by default, and deterministic validation continues to inspect metadata instead of raw screenshots, traces, DOM dumps, or reports. + +## Problem Frame + +Agents can overclaim frontend completion when they only inspect source code, static tests, or screenshots without structured claim linkage. Workspine already has UI proof slots and observed proof bundles, but the current metadata does not make the browser-provider thesis measurable in-flight. In practice, snapshots can be heavy, screenshots may be necessary for visual truth, Chrome/CDP paths can be powerful but expensive or privacy-sensitive, and MCP wrappers should not become default infrastructure by accident. + +This plan makes each provider path and capture mode comparable while preserving the current proof boundary: browser artifacts support existing evidence kinds; they do not create new evidence kinds. + +## Requirements Trace + +- R1. Preserve the existing five evidence kinds: `code`, `test`, `runtime`, `delivery`, and `human`. +- R2. Keep `agent-browser` as the default live rendered UI proof path in documentation and agent guidance. +- R3. Add optional metadata for provider selection, fallback path, capture modes, budgets, latency, text size, token estimate, artifact size, fidelity, and privacy posture. +- R4. Do not require Chrome DevTools MCP, Playwright MCP, new browser installs, CI, Storybook, or visual-regression infrastructure. +- R5. Validate new browser benchmark annotations deterministically when present, while keeping existing proof bundles valid. +- R6. Compare planned capture requirements to observed capture metadata when a planned UI proof slot opts into runtime capture requirements. +- R7. Keep raw screenshots, traces, videos, DOM snapshots, and reports local-only and unsafe to publish unless explicitly sanitized. +- R8. Document the provider chain: `agent-browser` primary, direct CDP attach as escalation, Chrome DevTools MCP and Playwright MCP only when already configured, browser launch only with explicit opt-in. +- R9. Record the `gpt-5.4-high` research/deepening model requirement as a planning constraint; do not claim model-pinned subagent research ran unless runtime routing can prove it. + +## Scope Boundaries + +- Do not implement live direct-CDP capture in this first slice. +- Do not add a new browser automation dependency. +- Do not introduce a new evidence kind such as `visual` or `browser`. +- Do not make Angular a required dependency or special-case validator path. +- Do not treat screenshots, snapshots, traces, DOM dumps, or framework state as verdicts by themselves. +- Do not change release, delivery, or public-proof privacy requirements. + +### Deferred to Separate Tasks + +- Live direct-CDP capture provider: separate feature after the proof contract can benchmark providers. +- Angular runtime adapter: separate feature after generic selected-element and framework-state metadata exists. +- Aggregated local metrics history across proof bundles: separate feature if per-bundle annotations prove useful. +- General model routing for arbitrary research and document-review subagents: separate model-orchestration feature. This plan only records the constraint and avoids model-unpinnable subagent claims. + +## Context & Research + +### Relevant Code and Patterns + +- `distilled/templates/ui-proof.md` defines planned UI proof slots, observed proof bundles, privacy defaults, and deterministic validation guidance. +- `bin/lib/ui-proof.mjs` validates UI proof metadata and compares planned slots to observed bundles without inspecting raw artifact contents. +- `distilled/workflows/plan.md`, `distilled/workflows/execute.md`, `distilled/workflows/verify.md`, and `distilled/workflows/quick.md` already describe `agent-browser` as the default runtime proof path and Playwright tests as repeatable regression evidence. +- `agents/planner.md`, `agents/executor.md`, and `agents/verifier.md` mirror those workflow contracts for installed agent surfaces. +- `tests/phase.test.cjs` contains UI proof validation and compare behavior tests. +- `tests/gsdd.guards.test.cjs` contains locked guard tests for the UI proof contract, including provider-agnostic validation and the `agent-browser` default. +- `bin/lib/models.mjs` only exposes portable agent model config for `plan-checker` and `approach-explorer`; it does not currently provide general model-pinned research subagent routing. +- Local goal handoff captured sibling-repo browser-intent and ideaspine context used as origin material for this plan. + +### Institutional Learnings + +- Keep UI proof scoped to a claim, route/state, viewport, evidence kind, artifact link, privacy metadata, result, and claim limit. +- Artifact count is not proof. +- Raw UI artifacts default to local-only and unsafe. +- A fallback browser tool can support a narrowed local runtime claim, but it must not pretend the default path ran. +- Browser snapshots can become token-heavy; proof collection must be targeted and budgeted. +- Queryable targeted state beats dumping full DOM or component trees. +- Screenshots are evidence, not a verdict. +- Human acceptance is required for subjective visual taste, baseline acceptance, and privacy publication decisions, but it does not replace missing non-human evidence. + +### External References + +- Existing Workspine source references are tracked in `distilled/EVIDENCE-INDEX.md`. +- This plan relies on the repo's recorded browser proof references there rather than introducing a new external dependency. + +## Key Technical Decisions + +- Add optional `runtime_capture` metadata to observed proof bundles: This keeps old bundles valid while allowing new bundles to record provider, cost, fidelity, and capture-mode data. +- Add optional `runtime_capture_requirements` to planned UI proof slots: Planned slots opt into benchmark enforcement only when the proof claim needs it. +- Validate shape separately from proof sufficiency: `gsdd ui-proof validate` should reject malformed benchmark annotations; `gsdd ui-proof compare` should decide whether observed capture satisfies planned capture requirements. +- Keep provider IDs open but syntactically constrained: Use the existing concise tool-ID pattern instead of hard-coding a provider enum into validation. +- Keep capture modes enumerated: Capture modes need stable names so costs and fidelity can be compared across providers. +- Treat budget overruns as comparison failures only when a plan declares budgets: Existing proof bundles should not become invalid just because they lack budget metadata. +- Keep screenshots as actual artifact paths: The proof bundle should reference screenshot files and metadata, not inline binary data or a screenshot transcript. +- Direct CDP is an escalation strategy, not the default: It is powerful for DOM/CSS/computed-style evidence, but the first Workspine feature should measure provider paths before owning browser lifecycle. + +## Open Questions + +### Resolved During Planning + +- Should Playwright MCP be the first-class default? No. Prior local context warned that snapshot cost can exceed practical phase-exit budgets, and current Workspine docs already default to `agent-browser`. +- Should Chrome DevTools MCP be the default? No. It is useful for deep debugging, but it has profile, privacy, and provider-lock risks and should be optional when already configured. +- Should V1 build Angular inspection? No. V1 should admit optional framework-state evidence without requiring Angular. +- Should the validator inspect raw screenshot pixels or DOM contents? No. Workspine's validator remains metadata-focused and provider-neutral. + +### Deferred to Implementation + +- Exact default numeric budgets: Implementation should start with conservative documented defaults and adjust only with tests and sample bundles. +- Exact token-estimation method: The first implementation can record `estimated_tokens` plus `token_estimate_method`; exact tokenizer parity is not required for metadata comparison. +- Exact capture-mode names after implementation touch: The list below is directional and should be finalized in constants when editing `bin/lib/ui-proof.mjs`. +- Whether comparison returns warnings or partial status for non-budget fidelity gaps: Implementation should follow existing `compareUiProofSlots` style and avoid weakening current blockers. + +## High-Level Technical Design + +> This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce. + +Add optional planned-slot metadata: + +```json +{ + "runtime_capture_requirements": { + "provider_preference": ["agent-browser", "direct-cdp"], + "fallback_policy": "record_availability_and_narrow_claim", + "required_modes": ["screenshot", "interactive_snapshot"], + "optional_modes": ["selected_element_dom", "computed_style", "console_delta", "network_delta", "framework_state"], + "budgets": { + "text_bytes_max": 24000, + "estimated_tokens_max": 6000, + "raw_artifact_bytes_max": 5000000, + "screenshot_count_max": 4, + "computed_style_properties_max": 80, + "console_event_count_max": 50, + "network_event_count_max": 50 + } + } +} +``` + +Add optional observed-bundle metadata: + +```json +{ + "runtime_capture": { + "provider": { + "primary": "agent-browser", + "selected": "agent-browser", + "fallback_chain": ["agent-browser", "direct-cdp", "chrome-devtools-mcp", "playwright-mcp", "manual"], + "fallback_reason": null, + "availability": [ + { "provider": "agent-browser", "status": "available" } + ] + }, + "captures": [ + { + "mode": "screenshot", + "slot_ids": ["ui-01"], + "artifact_refs": ["{work_item_dir}/artifacts/example-1280.png"], + "latency_ms": 420, + "raw_bytes": 184224, + "text_bytes": 0, + "estimated_tokens": 0, + "token_estimate_method": "not_applicable", + "result": "passed" + } + ], + "fidelity": { + "sees_pixels": true, + "includes_accessibility_tree": true, + "includes_dom_subset": false, + "includes_computed_styles": false, + "includes_framework_state": false, + "claim_limits": ["No selected-element computed style capture was required for this slot."] + } + } +} +``` + +Provider chain to document and benchmark: + +| Provider path | Default role | Screenshot support | DOM/CSS depth | Main cost/risk | V1 treatment | +| --- | --- | --- | --- | --- | --- | +| `agent-browser` | Primary live UI proof path | Yes | Snapshot/refs, tool-dependent | Availability and snapshot size | Default in docs and examples | +| `direct-cdp` | Escalation | Yes | Deep DOM, CSS, runtime, network, logs | Approved browser profile and implementation complexity | Metadata-supported, live provider deferred | +| `chrome-devtools-mcp` | Optional configured deep-debug path | Yes | Deep DevTools surface | Profile/privacy/tool lock | Record only when already configured | +| `playwright-mcp` | Optional configured snapshot path | Yes | Accessibility snapshot oriented | Snapshot token cost and setup lock | Record only when already configured | +| `manual` | Last fallback or subjective judgment | Human-dependent | Human-dependent | Low automation assurance | Can waive, defer, or narrow claims only | + +Initial capture-mode vocabulary: + +| Mode | Meaning | +| --- | --- | +| `screenshot` | Pixel artifact for the exact route/state/viewport. | +| `interactive_snapshot` | Tool snapshot/refs used to identify and interact with rendered elements. | +| `accessibility_snapshot` | Accessibility-tree or role/name/state structure. | +| `dom_subset` | Scoped DOM structure, not a full page dump. | +| `selected_element_dom` | DOM and attributes for a targeted element or component root. | +| `computed_style` | Bounded computed-style declaration set for selected elements. | +| `console_delta` | Scoped console events observed during the proof window. | +| `network_delta` | Scoped network events observed during the proof window. | +| `framework_state` | Optional framework adapter state, such as Angular ownership or public bound state. | +| `manual_observation` | Human-recorded observation for subjective or fallback claims. | + +Initial provider availability statuses: `available`, `unavailable`, `not_configured`, `skipped`, and `failed`. + +## Implementation Units + +- [ ] **Unit 1: Runtime Capture Metadata Validation** + +**Goal:** Add optional validation for `runtime_capture` on observed bundles and `runtime_capture_requirements` on planned slots without changing existing required fields. + +**Requirements:** R1, R3, R5, R7 + +**Dependencies:** None + +**Files:** +- Modify: `bin/lib/ui-proof.mjs` +- Test: `tests/phase.test.cjs` + +**Approach:** +- Introduce constants for browser capture mode IDs and budget metric field names. +- Introduce constants for provider availability statuses. +- Reuse existing status values where possible: capture `result` should use the current claim statuses. +- Validate provider IDs with the existing concise tool-ID pattern rather than adding provider-specific schema locks. +- Validate numeric metrics as non-negative finite numbers. +- Validate capture `artifact_refs` against declared `artifacts` when present, matching current observation-to-artifact linkage behavior. +- Validate raw artifact privacy through the existing artifact validation path, not through a second privacy system. +- Permit omitted `runtime_capture` and `runtime_capture_requirements` so existing bundles remain valid. + +**Patterns to follow:** +- Existing `validateUiProofBundle`, `validateUiProofSlots`, `validateArtifacts`, and `validateObservationArtifactRefs` style in `bin/lib/ui-proof.mjs`. +- Existing invalid-metadata tests around UI proof bundles in `tests/phase.test.cjs`. + +**Test scenarios:** +- Happy path: an existing valid proof bundle with no `runtime_capture` remains valid. +- Happy path: a valid bundle with `runtime_capture.provider`, one screenshot capture, and matching artifact refs validates. +- Happy path: a planned slot with `runtime_capture_requirements.required_modes` and budgets validates. +- Edge case: unknown but syntactically valid provider ID validates to preserve provider neutrality. +- Error path: provider ID with spaces or unsupported characters fails validation. +- Error path: capture mode outside the allowed mode list fails validation. +- Error path: provider availability status outside the allowed status list fails validation. +- Error path: negative `latency_ms`, `raw_bytes`, `text_bytes`, or `estimated_tokens` fails validation. +- Error path: capture `artifact_refs` points to an undeclared artifact and fails validation. +- Error path: raw screenshot artifact still fails public/release proof validation unless safe-to-publish metadata is valid. + +**Verification:** +- `ui-proof` validation accepts old bundles, accepts valid annotated bundles, and rejects malformed benchmark annotations with actionable error codes. + +- [ ] **Unit 2: Planned-vs-Observed Capture Comparison** + +**Goal:** Make `gsdd ui-proof compare` evaluate planned runtime capture requirements against observed runtime capture metadata when a slot opts in. + +**Requirements:** R3, R5, R6, R7 + +**Dependencies:** Unit 1 + +**Files:** +- Modify: `bin/lib/ui-proof.mjs` +- Test: `tests/phase.test.cjs` + +**Approach:** +- Extend slot comparison only when `runtime_capture_requirements` exists on the planned slot. +- Require each planned `required_modes` value to appear in at least one passed observed capture for the slot. +- Compare declared budgets to aggregated observed metrics for captures linked to the slot. +- Treat missing required modes or budget overruns as comparison issues, producing `partial` unless all other comparison logic already yields `missing`. +- Do not fail solely because the selected provider differs from the preferred provider if the observed bundle records availability, fallback reason, and claim limits. +- Do fail or downgrade when an observed fallback silently omits fallback reason and claim narrowing. +- Keep comparison output compatible with current `compareUiProofSlots` result shape. + +**Patterns to follow:** +- Existing `compareSlotToBundle` issue construction and `decorateComparisonIssue` behavior in `bin/lib/ui-proof.mjs`. +- Existing phase verification tests for missing, partial, and satisfied UI proof comparison. + +**Test scenarios:** +- Happy path: planned screenshot plus interactive snapshot requirements are satisfied by observed passed captures linked to the slot. +- Happy path: selected `direct-cdp` satisfies an `agent-browser` preference when `agent-browser` is recorded unavailable and the fallback reason is present. +- Edge case: optional capture modes are absent and do not block comparison. +- Edge case: captures for another slot do not satisfy the current slot. +- Error path: required `computed_style` capture is missing and comparison reports `partial`. +- Error path: observed `estimated_tokens` exceeds planned budget and comparison reports a budget issue. +- Error path: fallback provider is used without availability/fallback explanation and comparison records a fallback issue. +- Integration: `gsdd verify ` still blocks phase closure when capture requirements are planned but observed capture metadata is absent or partial. + +**Verification:** +- Planned capture requirements become deterministic comparison inputs without weakening current slot, route/state, viewport, artifact, privacy, and claim-limit checks. + +- [ ] **Unit 3: UI Proof Template and Workflow Guidance** + +**Goal:** Update the user-facing UI proof contract so planners, executors, and verifiers know when and how to collect benchmarked browser evidence. + +**Requirements:** R2, R3, R4, R7, R8, R9 + +**Dependencies:** Unit 1, Unit 2 + +**Files:** +- Modify: `distilled/templates/ui-proof.md` +- Modify: `distilled/workflows/plan.md` +- Modify: `distilled/workflows/execute.md` +- Modify: `distilled/workflows/verify.md` +- Modify: `distilled/workflows/quick.md` +- Modify: `agents/planner.md` +- Modify: `agents/executor.md` +- Modify: `agents/verifier.md` +- Test: `tests/gsdd.guards.test.cjs` + +**Approach:** +- Add a compact optional "Runtime Capture Benchmarks" section to `distilled/templates/ui-proof.md`. +- Show benchmark annotations as optional metadata, not as new required top-level fields for every bundle. +- Preserve existing default language: `agent-browser` first, project-native fallback when unavailable, Playwright tests as repeatable regression evidence. +- Add direct-CDP escalation language for selected-element DOM/CSS/computed-style claims without making direct-CDP a required provider. +- State that Chrome DevTools MCP and Playwright MCP are optional only when already configured and scoped to the claim. +- State that `gpt-5.4-high` research/deepening requirements must be proven through runtime model routing before an agent claims such review ran. +- Update installed agent surfaces with the same semantics so generated guidance stays coherent. + +**Patterns to follow:** +- Existing wording in `distilled/templates/ui-proof.md` around default `agent-browser`, no new browser infrastructure, privacy defaults, and deterministic validation. +- Existing guard tests that preserve provider-agnostic validation and `agent-browser` default. + +**Test scenarios:** +- Guard: docs still name `agent-browser` as the default live UI proof path. +- Guard: docs still prohibit adding Playwright, Cypress, Storybook, browser MCP, CI, or visual-regression tooling by default. +- Guard: docs mention direct-CDP only as an escalation/fallback path, not the default. +- Guard: docs require benchmark annotations to stay provider-neutral and budgeted. +- Guard: docs preserve raw artifact privacy defaults. +- Guard: agent role files mirror the updated provider chain and benchmark posture. + +**Verification:** +- The generated guidance tells future agents how to collect screenshots plus targeted snapshots/CSS evidence without turning optional browser providers into default infrastructure. + +- [ ] **Unit 4: Design Record and Evidence Index** + +**Goal:** Record the architectural decision so future changes cannot reinterpret benchmark annotations as a provider lock or new evidence kind. + +**Requirements:** R1, R2, R4, R7, R8 + +**Dependencies:** Unit 3 + +**Files:** +- Modify: `distilled/DESIGN.md` +- Modify: `distilled/EVIDENCE-INDEX.md` +- Test: `tests/gsdd.guards.test.cjs` + +**Approach:** +- Add a design decision extending the existing UI proof decision with browser runtime capture benchmark annotations. +- Record the selected provider chain and the non-goals. +- Record why live direct-CDP implementation is deferred. +- Record that raw artifacts remain local-only by default. +- Record that validator behavior remains metadata-focused and provider-neutral. +- Add evidence-index entries for this plan, fixtures, and current Workspine files. + +**Patterns to follow:** +- Existing decision-entry style in `distilled/DESIGN.md`. +- Existing evidence-index style around UI proof and design decisions. + +**Test scenarios:** +- Guard: design docs preserve fixed evidence kinds. +- Guard: design docs preserve provider-neutral validation. +- Guard: design docs record `agent-browser` primary plus direct-CDP escalation without mandating Chrome DevTools MCP or Playwright MCP. +- Guard: evidence index references the new decision and relevant source files. + +**Verification:** +- The decision record prevents future agents from treating benchmark metadata as permission to add a default sidecar, provider lock, or new evidence kind. + +- [ ] **Unit 5: Local Fixtures and Dogfood Proof Examples** + +**Goal:** Add compact example proof bundles and planned slots that demonstrate benchmark annotations without requiring a real browser provider during tests. + +**Requirements:** R3, R5, R6, R7, R8 + +**Dependencies:** Unit 1, Unit 2, Unit 3 + +**Files:** +- Create: `fixtures/ui-proof/browser-runtime-capture-slots.json` +- Create: `fixtures/ui-proof/browser-runtime-capture-bundle.json` +- Modify: `tests/phase.test.cjs` +- Modify: `tests/gsdd.guards.test.cjs` + +**Approach:** +- Keep fixtures synthetic and local-only. +- Include one satisfied `agent-browser` primary example with screenshot and interactive snapshot captures. +- Include one direct-CDP escalation example for selected-element DOM/CSS/computed-style capture, with explicit fallback reason and claim limits. +- Include no raw DOM dump content in fixtures; use metadata and artifact refs only. +- Use existing test helper patterns rather than adding fixture loaders unless local test conventions justify it. + +**Patterns to follow:** +- Existing dogfood UI proof examples embedded in `tests/phase.test.cjs`. +- Existing fixture directory conventions. + +**Test scenarios:** +- Happy path: fixture planned slots and observed bundle compare as satisfied. +- Happy path: direct-CDP fallback fixture is accepted because fallback is explicit and claim-limited. +- Error path: mutated fixture with missing screenshot capture produces partial comparison. +- Error path: mutated fixture with public claim backed by local-only screenshot remains invalid. +- Integration: fixture paths remain workspace-relative and do not require real screenshot files unless validation is explicitly run with local-artifact existence checks. + +**Verification:** +- Future contributors have a compact, deterministic proof example for benchmark annotations and provider fallback behavior. + +- [ ] **Unit 6: CLI Help, Health Messaging, and Backward Compatibility Review** + +**Goal:** Make the new optional metadata discoverable without changing the top-level command surface unless implementation proves a command addition is necessary. + +**Requirements:** R3, R4, R5, R9 + +**Dependencies:** Unit 1 through Unit 5 + +**Files:** +- Modify: `bin/lib/init-runtime.mjs` +- Modify: `bin/lib/rendering.mjs` +- Modify: `bin/lib/health.mjs` +- Modify: `README.md` +- Modify: `docs/USER-GUIDE.md` +- Test: `tests/gsdd.init.test.cjs` +- Test: `tests/gsdd.health.test.cjs` +- Test: `tests/gsdd.guards.test.cjs` + +**Approach:** +- Prefer documenting the feature under the existing `ui-proof validate` and `ui-proof compare` commands. +- Avoid adding `gsdd ui-proof benchmark` unless implementation shows the existing command output cannot surface needed comparison issues cleanly. +- Update health fix hints only where malformed benchmark metadata should be actionable. +- Document that model-pinned subagent review claims require actual runtime routing support; otherwise agents should record reduced assurance. +- Preserve current command usage strings and output compatibility. + +**Patterns to follow:** +- Existing help rendering in `bin/lib/init-runtime.mjs` and `bin/lib/rendering.mjs`. +- Existing health E10 wording around UI proof metadata. + +**Test scenarios:** +- Happy path: `gsdd ui-proof validate` output shape remains compatible for old bundles. +- Happy path: health reports malformed benchmark metadata through existing UI proof metadata failure paths. +- Edge case: project without browser tooling still passes health when no UI proof bundle requires browser capture. +- Guard: help/docs do not imply a new browser provider must be installed. +- Guard: model-pinned research wording does not claim unsupported general subagent routing exists. + +**Verification:** +- Users can discover benchmark annotations through existing UI proof docs and commands without a new browser command becoming an accidental product promise. + +## System-Wide Impact + +- **Interaction graph:** Planned slots feed `gsdd ui-proof compare`; observed bundles feed `gsdd ui-proof validate`, `gsdd ui-proof compare`, `gsdd health`, and `gsdd verify`. +- **Error propagation:** Malformed benchmark metadata should surface as validation errors; unmet planned capture requirements should surface as comparison issues and verification blockers where applicable. +- **State lifecycle risks:** Raw artifacts remain local files referenced by metadata; metadata must not inline screenshots, DOM dumps, traces, or sensitive browser state. +- **API surface parity:** Installed workflow docs and agent role docs must match the source `distilled/` contract. +- **Integration coverage:** Tests must cover direct validation, planned-vs-observed comparison, phase verification, health, guard docs, and old-bundle backward compatibility. +- **Unchanged invariants:** Fixed evidence kinds remain unchanged; `agent-browser` remains the default live runtime path; provider-specific tools remain optional; Playwright tests remain repeatable regression evidence, not a replacement for scoped runtime proof. + +## Risks & Dependencies + +| Risk | Mitigation | +| --- | --- | +| Benchmark metadata becomes a provider lock | Validate provider ID syntax, not a hard provider enum; document provider neutrality. | +| Metadata becomes too heavy | Store metrics and artifact refs, not raw DOM, screenshots, traces, or full logs. | +| Direct-CDP scope creeps into V1 | Record direct-CDP as escalation metadata only; defer live provider implementation. | +| Existing bundles break | Keep new fields optional and add old-bundle compatibility tests. | +| Compare output becomes noisy | Only enforce capture requirements when planned slots opt in. | +| Privacy rules fork | Reuse existing artifact privacy validation and public-claim checks. | +| Agents claim `gpt-5.4-high` review without proof | Document reduced assurance unless runtime model routing proves the requested model was used. | +| Docs drift from generated surfaces | Update source docs, agent role docs, and guard tests together. | + +## Documentation / Operational Notes + +- This is not a UI-visible feature, so the implementation plan itself should use `no_ui_proof_rationale` if converted into a `.planning` phase plan. +- Execution should be characterization-first around existing UI proof behavior: add tests that lock old valid bundles before adding new optional metadata. +- The implementation should not run or install browser tooling to satisfy tests. +- If implementation discovers that `runtime_capture` is a poor field name, the replacement must preserve the same boundary: optional, provider-neutral, metadata-only, and budgetable. +- Do not claim independent `gpt-5.4-high` research or document review unless a runtime route exposes model selection and records the model used. + +## Plan Review Status + +- `ce:plan` was used to produce this plan from the local goal handoff and repo research. +- Independent `document-review` subagents were not spawned because this runtime did not expose a model-selectable route, and the origin directive requires `gpt-5.4-high` for research/deepening subagents. +- Self-review checked scope boundaries, provider-default consistency, fixed evidence kinds, privacy invariants, benchmark vocabulary, test coverage expectations, repo-relative paths, ASCII encoding, and diff hygiene. +- Residual risk: run a model-pinned independent document-review pass before implementation if a runtime route can prove `gpt-5.4-high` was used. + +## Sources & References + +- Origin capture: local goal handoff retained outside the public PR. +- UI proof template: `distilled/templates/ui-proof.md` +- UI proof validator and comparator: `bin/lib/ui-proof.mjs` +- Planner workflow: `distilled/workflows/plan.md` +- Executor workflow: `distilled/workflows/execute.md` +- Verifier workflow: `distilled/workflows/verify.md` +- Quick workflow: `distilled/workflows/quick.md` +- Agent planner role: `agents/planner.md` +- Agent executor role: `agents/executor.md` +- Agent verifier role: `agents/verifier.md` +- Design record: `distilled/DESIGN.md` +- Evidence index: `distilled/EVIDENCE-INDEX.md` +- Model routing config: `bin/lib/models.mjs` +- UI proof tests: `tests/phase.test.cjs` +- Contract guard tests: `tests/gsdd.guards.test.cjs` diff --git a/fixtures/ui-proof/browser-runtime-capture-bundle.json b/fixtures/ui-proof/browser-runtime-capture-bundle.json new file mode 100644 index 0000000..7a401de --- /dev/null +++ b/fixtures/ui-proof/browser-runtime-capture-bundle.json @@ -0,0 +1,203 @@ +{ + "proof_bundle_version": 1, + "scope": { + "work_item": "browser-runtime-capture-fixture", + "requirement_ids": ["UIPROOF-BROWSER-01", "UIPROOF-BROWSER-02"], + "slot_ids": ["ui-browser-agent-primary", "ui-browser-direct-cdp-fallback"], + "claim": "Browser runtime capture metadata can support scoped local UI proof without installing browser infrastructure." + }, + "route_state": { + "route": "/synthetic-browser-proof", + "state": "seeded synthetic browser capture fixture" + }, + "environment": { + "app_url": "http://localhost:3000/synthetic-browser-proof", + "data_state": "synthetic" + }, + "viewport": { + "width": 1280, + "height": 720 + }, + "evidence_inputs": { + "kinds": ["test", "runtime"], + "tools_used": ["agent-browser", "direct-cdp"] + }, + "commands_or_manual_steps": [ + { + "command": "gsdd ui-proof compare fixtures/ui-proof/browser-runtime-capture-slots.json fixtures/ui-proof/browser-runtime-capture-bundle.json", + "result": "passed", + "attempts": 1 + } + ], + "observations": [ + { + "observation": "Rendered state has a screenshot capture linked to a local-only artifact.", + "claim": "Browser runtime capture metadata can support scoped local UI proof without installing browser infrastructure.", + "route_state": { + "route": "/synthetic-browser-proof", + "state": "seeded synthetic browser capture fixture" + }, + "evidence_kind": "runtime", + "artifact_refs": ["fixtures/ui-proof/artifacts/synthetic-browser-1280.png", "fixtures/ui-proof/browser-runtime-capture-bundle.json"], + "privacy": { + "data_classification": "synthetic", + "raw_artifacts_safe_to_publish": false, + "retention": "temporary_review" + }, + "result": "passed", + "claim_limit": "Proves benchmark metadata comparison only; does not prove live browser rendering, cross-browser behavior, or public release proof." + }, + { + "observation": "Interactive snapshot capture stays within the planned budget.", + "claim": "Browser runtime capture metadata can support scoped local UI proof without installing browser infrastructure.", + "route_state": { + "route": "/synthetic-browser-proof", + "state": "seeded synthetic browser capture fixture" + }, + "evidence_kind": "runtime", + "artifact_refs": ["fixtures/ui-proof/browser-runtime-capture-bundle.json"], + "privacy": { + "data_classification": "synthetic", + "raw_artifacts_safe_to_publish": false, + "retention": "temporary_review" + }, + "result": "passed", + "claim_limit": "Proves benchmark metadata comparison only; does not prove live browser rendering, cross-browser behavior, or public release proof." + }, + { + "observation": "Selected element computed-style capture is represented as bounded metadata.", + "claim": "Browser runtime capture metadata can support scoped local UI proof without installing browser infrastructure.", + "route_state": { + "route": "/synthetic-browser-proof", + "state": "seeded synthetic browser capture fixture" + }, + "evidence_kind": "runtime", + "artifact_refs": ["fixtures/ui-proof/browser-runtime-capture-bundle.json"], + "privacy": { + "data_classification": "synthetic", + "raw_artifacts_safe_to_publish": false, + "retention": "temporary_review" + }, + "result": "passed", + "claim_limit": "Proves bounded selected-element CSS metadata only; direct-cdp remains an escalation path, not the default provider." + }, + { + "observation": "Fixture comparison is covered by deterministic node:test assertions.", + "claim": "Browser runtime capture metadata can support scoped local UI proof without installing browser infrastructure.", + "route_state": { + "route": "/synthetic-browser-proof", + "state": "seeded synthetic browser capture fixture" + }, + "evidence_kind": "test", + "artifact_refs": ["fixtures/ui-proof/browser-runtime-capture-bundle.json"], + "privacy": { + "data_classification": "synthetic", + "raw_artifacts_safe_to_publish": false, + "retention": "temporary_review" + }, + "result": "passed", + "claim_limit": "Proves benchmark metadata comparison only; does not prove live browser rendering, cross-browser behavior, or public release proof." + } + ], + "artifacts": [ + { + "path": "fixtures/ui-proof/artifacts/synthetic-browser-1280.png", + "type": "screenshot", + "visibility": "local_only", + "retention": "temporary_review", + "sensitivity": "synthetic", + "safe_to_publish": false, + "notes": "Synthetic local screenshot reference; raw artifact is not tracked and is not public proof." + }, + { + "path": "fixtures/ui-proof/browser-runtime-capture-bundle.json", + "type": "metadata", + "visibility": "local_only", + "retention": "temporary_review", + "sensitivity": "synthetic", + "safe_to_publish": false + } + ], + "runtime_capture": { + "provider": { + "primary": "agent-browser", + "selected": "agent-browser", + "fallback_chain": ["agent-browser", "direct-cdp", "chrome-devtools-mcp", "playwright-mcp", "manual"], + "fallback_reason": null, + "availability": [ + { + "provider": "agent-browser", + "status": "available" + }, + { + "provider": "direct-cdp", + "status": "available" + } + ] + }, + "captures": [ + { + "mode": "screenshot", + "slot_ids": ["ui-browser-agent-primary"], + "artifact_refs": ["fixtures/ui-proof/artifacts/synthetic-browser-1280.png"], + "latency_ms": 420, + "raw_bytes": 184224, + "text_bytes": 0, + "estimated_tokens": 0, + "token_estimate_method": "not_applicable", + "result": "passed" + }, + { + "mode": "interactive_snapshot", + "slot_ids": ["ui-browser-agent-primary"], + "latency_ms": 180, + "raw_bytes": 0, + "text_bytes": 2200, + "estimated_tokens": 550, + "token_estimate_method": "rough_char_div_4", + "result": "passed" + }, + { + "mode": "computed_style", + "provider": "direct-cdp", + "slot_ids": ["ui-browser-direct-cdp-fallback"], + "latency_ms": 95, + "raw_bytes": 0, + "text_bytes": 1800, + "estimated_tokens": 450, + "computed_style_properties": 34, + "token_estimate_method": "rough_char_div_4", + "result": "passed" + } + ], + "fidelity": { + "sees_pixels": true, + "includes_accessibility_tree": true, + "includes_dom_subset": false, + "includes_computed_styles": true, + "includes_framework_state": false, + "claim_limits": [ + "No raw DOM dump is included.", + "Computed-style evidence is bounded to the selected element." + ] + } + }, + "privacy": { + "data_classification": "synthetic", + "redactions": [], + "raw_artifacts_safe_to_publish": false, + "retention": "Keep metadata bundle; keep raw artifacts only while needed for local review." + }, + "result": { + "claim_status": "passed", + "comparison_status_by_slot": { + "ui-browser-agent-primary": "satisfied", + "ui-browser-direct-cdp-fallback": "satisfied" + }, + "failure_classification": null + }, + "claim_limits": [ + "Proves benchmark metadata comparison only; does not prove live browser rendering, cross-browser behavior, or public release proof.", + "Proves bounded selected-element CSS metadata only; direct-cdp remains an escalation path, not the default provider." + ] +} diff --git a/fixtures/ui-proof/browser-runtime-capture-slots.json b/fixtures/ui-proof/browser-runtime-capture-slots.json new file mode 100644 index 0000000..820ba7f --- /dev/null +++ b/fixtures/ui-proof/browser-runtime-capture-slots.json @@ -0,0 +1,78 @@ +{ + "ui_proof_slots": [ + { + "slot_id": "ui-browser-agent-primary", + "requirement_id": "UIPROOF-BROWSER-01", + "claim": "Browser runtime capture metadata can support scoped local UI proof without installing browser infrastructure.", + "route_state": { + "route": "/synthetic-browser-proof", + "state": "seeded synthetic browser capture fixture" + }, + "required_evidence_kinds": ["test", "runtime"], + "minimum_observations": [ + "Rendered state has a screenshot capture linked to a local-only artifact.", + "Interactive snapshot capture stays within the planned budget." + ], + "expected_artifact_types": ["screenshot", "metadata"], + "validation_command": "gsdd ui-proof compare fixtures/ui-proof/browser-runtime-capture-slots.json fixtures/ui-proof/browser-runtime-capture-bundle.json", + "environment": { + "app_url": "http://localhost:3000/synthetic-browser-proof", + "data_state": "synthetic" + }, + "viewport": { + "width": 1280, + "height": 720 + }, + "manual_acceptance_required": false, + "claim_limit": "Proves benchmark metadata comparison only; does not prove live browser rendering, cross-browser behavior, or public release proof.", + "runtime_capture_requirements": { + "provider_preference": ["agent-browser"], + "fallback_policy": "record_availability_and_narrow_claim", + "required_modes": ["screenshot", "interactive_snapshot"], + "optional_modes": ["console_delta", "network_delta"], + "budgets": { + "text_bytes_max": 24000, + "estimated_tokens_max": 6000, + "raw_artifact_bytes_max": 5000000, + "screenshot_count_max": 4 + } + } + }, + { + "slot_id": "ui-browser-direct-cdp-fallback", + "requirement_id": "UIPROOF-BROWSER-02", + "claim": "Browser runtime capture metadata can support scoped local UI proof without installing browser infrastructure.", + "route_state": { + "route": "/synthetic-browser-proof", + "state": "seeded synthetic browser capture fixture" + }, + "required_evidence_kinds": ["test", "runtime"], + "minimum_observations": [ + "Selected element computed-style capture is represented as bounded metadata." + ], + "expected_artifact_types": ["screenshot", "metadata"], + "validation_command": "gsdd ui-proof compare fixtures/ui-proof/browser-runtime-capture-slots.json fixtures/ui-proof/browser-runtime-capture-bundle.json", + "environment": { + "app_url": "http://localhost:3000/synthetic-browser-proof", + "data_state": "synthetic" + }, + "viewport": { + "width": 1280, + "height": 720 + }, + "manual_acceptance_required": false, + "claim_limit": "Proves bounded selected-element CSS metadata only; direct-cdp remains an escalation path, not the default provider.", + "runtime_capture_requirements": { + "provider_preference": ["agent-browser", "direct-cdp"], + "fallback_policy": "record_availability_and_narrow_claim", + "required_modes": ["computed_style"], + "optional_modes": ["selected_element_dom"], + "budgets": { + "text_bytes_max": 24000, + "estimated_tokens_max": 6000, + "computed_style_properties_max": 80 + } + } + } + ] +} diff --git a/tests/gsdd.guards.test.cjs b/tests/gsdd.guards.test.cjs index 27ff711..feb7054 100644 --- a/tests/gsdd.guards.test.cjs +++ b/tests/gsdd.guards.test.cjs @@ -3507,6 +3507,13 @@ describe('G55 - UI Proof Contract', () => { const verifierRole = fs.readFileSync(path.join(ROOT, 'agents', 'verifier.md'), 'utf-8'); const planChecker = fs.readFileSync(path.join(ROOT, 'distilled', 'templates', 'delegates', 'plan-checker.md'), 'utf-8'); const uiProofSource = fs.readFileSync(path.join(ROOT, 'bin', 'lib', 'ui-proof.mjs'), 'utf-8'); + const designRecord = fs.readFileSync(path.join(ROOT, 'distilled', 'DESIGN.md'), 'utf-8'); + const evidenceIndex = fs.readFileSync(path.join(ROOT, 'distilled', 'EVIDENCE-INDEX.md'), 'utf-8'); + const readme = fs.readFileSync(path.join(ROOT, 'README.md'), 'utf-8'); + const userGuide = fs.readFileSync(path.join(ROOT, 'docs', 'USER-GUIDE.md'), 'utf-8'); + const initRuntime = fs.readFileSync(path.join(ROOT, 'bin', 'lib', 'init-runtime.mjs'), 'utf-8'); + const rendering = fs.readFileSync(path.join(ROOT, 'bin', 'lib', 'rendering.mjs'), 'utf-8'); + const healthSource = fs.readFileSync(path.join(ROOT, 'bin', 'lib', 'health.mjs'), 'utf-8'); function parseObservedBundleExample() { const match = template.match(/```json\s*\n([\s\S]*?)\n```/); @@ -3538,6 +3545,8 @@ describe('G55 - UI Proof Contract', () => { 'result', 'claim_status', 'claim_limits', + 'runtime_capture_requirements', + 'runtime_capture', ]) { assert.match(template, new RegExp(token), `ui-proof.md must include ${token}. FIX: Restore the locked UI proof schema field.`); } @@ -3622,6 +3631,102 @@ describe('G55 - UI Proof Contract', () => { 'bin/lib/ui-proof.mjs must remain provider-agnostic metadata validation, not an agent-browser schema gate.'); }); + test('runtime capture benchmarks stay optional provider-neutral and budgeted', () => { + const combined = [ + template, + planContent, + executeContent, + quickContent, + verifyContent, + plannerRole, + executorRole, + verifierRole, + planChecker, + designRecord, + evidenceIndex, + readme, + userGuide, + initRuntime, + rendering, + healthSource + ].join('\n'); + + assert.match(template, /runtime_capture_requirements/i, + 'ui-proof.md must document planned runtime capture requirements.'); + assert.match(template, /runtime_capture/i, + 'ui-proof.md must document observed runtime capture metadata.'); + assert.match(combined, /provider-neutral|provider-agnostic/i, + 'Runtime capture annotations must remain provider-neutral metadata.'); + assert.match(combined, /metadata-only|metadata-focused|does not inspect raw/i, + 'Runtime capture validation must stay metadata-focused.'); + assert.match(combined, /agent-browser[\s\S]{0,220}(default|first)/i, + 'Runtime capture guidance must keep agent-browser as the default/first live UI proof path.'); + assert.match(combined, /(direct-CDP|direct-cdp)[\s\S]{0,220}escalation/i, + 'Runtime capture guidance must describe direct-CDP as escalation, not the default.'); + assert.match(combined, /Chrome DevTools MCP[\s\S]{0,220}Playwright MCP[\s\S]{0,220}optional only when already configured/i, + 'Runtime capture guidance must keep Chrome DevTools MCP and Playwright MCP optional and preconfigured.'); + assert.match(combined, /Do not (add|plan|install|scaffold)[\s\S]{0,240}(browser tooling|browser installs|browser MCP|CI|Storybook|visual-regression)/i, + 'Runtime capture guidance must not introduce default browser infrastructure.'); + assert.match(combined, /raw screenshots[\s\S]{0,260}(local-only|local_only|safe_to_publish: false)/i, + 'Runtime capture guidance must preserve raw artifact local-only privacy defaults.'); + assert.match(combined, /gpt-5\.4-high[\s\S]{0,180}(runtime model-routing evidence|model-routing evidence|prove)/i, + 'Runtime capture planning must not claim model-pinned research without runtime routing proof.'); + + for (const mode of [ + 'screenshot', + 'interactive_snapshot', + 'accessibility_snapshot', + 'dom_subset', + 'selected_element_dom', + 'computed_style', + 'console_delta', + 'network_delta', + 'framework_state', + 'manual_observation' + ]) { + assert.match(template, new RegExp('`' + mode + '`'), `ui-proof.md must document runtime capture mode ${mode}.`); + assert.match(uiProofSource, new RegExp(`'${mode}'`), `ui-proof validator must define runtime capture mode ${mode}.`); + } + + for (const status of ['available', 'unavailable', 'not_configured', 'skipped', 'failed']) { + assert.match(template, new RegExp('`' + status + '`'), `ui-proof.md must document runtime capture provider status ${status}.`); + assert.match(uiProofSource, new RegExp(`'${status}'`), `ui-proof validator must define runtime capture provider status ${status}.`); + } + + for (const budget of [ + 'text_bytes_max', + 'estimated_tokens_max', + 'raw_artifact_bytes_max', + 'screenshot_count_max', + 'computed_style_properties_max', + 'console_event_count_max', + 'network_event_count_max' + ]) { + assert.match(template, new RegExp('`' + budget + '`'), `ui-proof.md must document runtime capture budget ${budget}.`); + assert.match(uiProofSource, new RegExp(budget), `ui-proof validator must define runtime capture budget ${budget}.`); + } + + for (const exportName of [ + 'UI_PROOF_RUNTIME_CAPTURE_AVAILABILITY_STATUSES', + 'UI_PROOF_RUNTIME_CAPTURE_BUDGET_FIELD_MAP', + 'UI_PROOF_RUNTIME_CAPTURE_METRIC_FIELDS', + 'UI_PROOF_RUNTIME_CAPTURE_MODES' + ]) { + assert.match(uiProofSource, new RegExp(exportName), `ui-proof.mjs must export ${exportName}.`); + } + + assert.match(healthSource, /runtime capture benchmark fields when present/i, + 'health E10 repair guidance must mention optional runtime capture metadata.'); + assert.match(initRuntime, /runtime capture annotations/i, + 'init help must mention optional runtime capture annotations.'); + assert.match(rendering, /runtime capture annotations/i, + 'rendered helper help must mention optional runtime capture annotations.'); + assert.match(readme, /without installing browser tooling by default/i, + 'README must keep runtime capture annotations non-infrastructural.'); + assert.match(userGuide, /agent-browser` remains the default live UI proof path/i, + 'User guide must keep agent-browser default wording.'); + }); + test('observed bundle example keeps runtime artifacts traceable', () => { const bundle = parseObservedBundleExample(); const declaredRefs = new Set(bundle.artifacts.map((artifact) => artifact.path || artifact.url)); diff --git a/tests/phase.test.cjs b/tests/phase.test.cjs index 246f9e8..c338acc 100644 --- a/tests/phase.test.cjs +++ b/tests/phase.test.cjs @@ -2699,6 +2699,93 @@ describe('Phase 57 UI proof validation helper', () => { assert.strictEqual(result.valid, true, JSON.stringify(result.errors)); }); + test('optional runtime capture benchmark metadata validates when provider neutral and artifact-linked', async () => { + const mod = await importUiProofModule(); + const bundle = validBundle({ + evidence_inputs: { kinds: ['test', 'runtime'], tools_used: ['agent-browser'] }, + }); + bundle.artifacts.push({ + path: 'artifacts/example-1280.png', + type: 'screenshot', + visibility: 'local_only', + retention: 'temporary_review', + sensitivity: 'synthetic', + safe_to_publish: false, + }); + bundle.observations[0].artifact_refs.push('artifacts/example-1280.png'); + bundle.runtime_capture = { + provider: { + primary: 'agent-browser', + selected: 'agent-browser', + fallback_chain: ['agent-browser', 'direct-cdp'], + availability: [{ provider: 'agent-browser', status: 'available' }], + }, + captures: [{ + mode: 'screenshot', + slot_ids: ['quick-001-ui-01'], + artifact_refs: ['artifacts/example-1280.png'], + latency_ms: 420, + raw_bytes: 184224, + text_bytes: 0, + estimated_tokens: 0, + token_estimate_method: 'not_applicable', + result: 'passed', + }, { + mode: 'interactive_snapshot', + slot_ids: ['quick-001-ui-01'], + latency_ms: 180, + text_bytes: 2200, + estimated_tokens: 550, + token_estimate_method: 'rough_char_div_4', + result: 'passed', + }], + fidelity: { + sees_pixels: true, + includes_accessibility_tree: true, + includes_dom_subset: false, + includes_computed_styles: false, + includes_framework_state: false, + claim_limits: ['No selected-element computed style capture was required.'], + }, + }; + + const result = mod.validateUiProofBundle(bundle); + assert.strictEqual(result.valid, true, JSON.stringify(result.errors)); + }); + + test('runtime capture benchmark metadata rejects malformed modes provider status metrics and artifact refs', async () => { + const mod = await importUiProofModule(); + const bundle = validBundle({ + runtime_capture: { + provider: { + primary: 'agent browser', + selected: 'agent-browser', + fallback_chain: ['agent-browser'], + availability: [{ provider: 'agent-browser', status: 'maybe' }], + }, + captures: [{ + mode: 'full_dom_dump', + slot_ids: ['quick-001-ui-01'], + artifact_refs: ['artifacts/missing.png'], + latency_ms: -1, + raw_bytes: 1, + text_bytes: 1, + estimated_tokens: 1, + result: 'ok', + }], + }, + }); + + const result = mod.validateUiProofBundle(bundle); + assert.strictEqual(result.valid, false); + assert.ok(result.errors.some((error) => error.code === 'invalid_runtime_capture_provider_id')); + assert.ok(result.errors.some((error) => error.code === 'invalid_runtime_capture_availability_status')); + assert.ok(result.errors.some((error) => error.code === 'unsupported_runtime_capture_mode')); + assert.ok(result.errors.some((error) => error.code === 'invalid_runtime_capture_metric')); + assert.ok(result.errors.some((error) => error.code === 'invalid_runtime_capture_result')); + assert.ok(result.errors.some((error) => error.code === 'unknown_runtime_capture_artifact_ref')); + }); + test('fenced JSON in markdown parses but YAML-only bundles fail', async () => { const mod = await importUiProofModule(); const bundle = validBundle(); @@ -2730,6 +2817,53 @@ describe('Phase 57 UI proof validation helper', () => { assert.ok(result.errors.some((error) => error.path === 'artifacts[0].safe_to_publish')); }); + test('planned runtime capture requirements are optional but validated when present', async () => { + const mod = await importUiProofModule(); + const baseSlot = { + slot_id: 'quick-001-ui-01', + claim: 'Local reviewer can inspect the changed UI proof metadata.', + route_state: { route: '/example', state: 'synthetic user' }, + required_evidence_kinds: ['runtime'], + minimum_observations: ['Changed state is visible.'], + expected_artifact_types: ['screenshot'], + validation_command: 'gsdd ui-proof compare .planning/ui-proof-slots.json .planning/ui-proof.json', + environment: { app_url: 'http://localhost:3000', data_state: 'synthetic' }, + viewport: { width: 1280, height: 720 }, + manual_acceptance_required: false, + claim_limit: 'Does not prove unrelated UI states.', + }; + + const valid = mod.validateUiProofSlots([{ + ...baseSlot, + runtime_capture_requirements: { + provider_preference: ['agent-browser', 'direct-cdp'], + fallback_policy: 'record_availability_and_narrow_claim', + required_modes: ['screenshot', 'interactive_snapshot'], + optional_modes: ['computed_style'], + budgets: { + text_bytes_max: 24000, + estimated_tokens_max: 6000, + raw_artifact_bytes_max: 5000000, + screenshot_count_max: 4, + }, + }, + }]); + assert.strictEqual(valid.valid, true, JSON.stringify(valid.errors)); + + const invalid = mod.validateUiProofSlots([{ + ...baseSlot, + runtime_capture_requirements: { + provider_preference: ['agent browser'], + required_modes: ['full_dom_dump'], + budgets: { estimated_tokens_max: -1 }, + }, + }]); + assert.strictEqual(invalid.valid, false); + assert.ok(invalid.errors.some((error) => error.code === 'invalid_runtime_capture_provider_id')); + assert.ok(invalid.errors.some((error) => error.code === 'unsupported_runtime_capture_mode')); + assert.ok(invalid.errors.some((error) => error.code === 'invalid_runtime_capture_budget')); + }); + test('tool provenance must use concise tool identifiers', async () => { const mod = await importUiProofModule(); const missingTools = validBundle({ evidence_inputs: { kinds: ['test', 'runtime'] } }); @@ -3210,6 +3344,85 @@ describe('Phase 58 dogfood and Phase 59 UI proof product comparison', () => { return slotsPath; } + function runtimeCaptureSlot(overrides = {}) { + return { + ...plannedSlots()[0], + expected_artifact_types: ['source', 'metadata', 'screenshot'], + runtime_capture_requirements: { + provider_preference: ['agent-browser'], + fallback_policy: 'record_availability_and_narrow_claim', + required_modes: ['screenshot', 'interactive_snapshot'], + optional_modes: ['computed_style'], + budgets: { + text_bytes_max: 24000, + estimated_tokens_max: 6000, + raw_artifact_bytes_max: 5000000, + screenshot_count_max: 4, + }, + }, + ...overrides, + }; + } + + function runtimeCaptureBundle(overrides = {}) { + const bundle = dogfoodBundle(); + const screenshotPath = '.planning/phases/58-dogfood-ui-proof-loop/artifacts/dogfood-1280.png'; + bundle.evidence_inputs = { kinds: ['code', 'test', 'runtime'], tools_used: ['node:test', 'agent-browser'] }; + bundle.artifacts = [ + ...bundle.artifacts, + { + path: screenshotPath, + type: 'screenshot', + visibility: 'local_only', + retention: 'temporary_review', + sensitivity: 'synthetic', + safe_to_publish: false, + }, + ]; + bundle.observations = bundle.observations.map((observation) => ({ + ...observation, + artifact_refs: [...observation.artifact_refs, screenshotPath], + })); + bundle.runtime_capture = { + provider: { + primary: 'agent-browser', + selected: 'agent-browser', + fallback_chain: ['agent-browser', 'direct-cdp', 'chrome-devtools-mcp', 'playwright-mcp', 'manual'], + fallback_reason: null, + availability: [{ provider: 'agent-browser', status: 'available' }], + }, + captures: [{ + mode: 'screenshot', + slot_ids: ['ui-58-valid-scoped-proof'], + artifact_refs: [screenshotPath], + latency_ms: 420, + raw_bytes: 184224, + text_bytes: 0, + estimated_tokens: 0, + token_estimate_method: 'not_applicable', + result: 'passed', + }, { + mode: 'interactive_snapshot', + slot_ids: ['ui-58-valid-scoped-proof'], + latency_ms: 180, + raw_bytes: 0, + text_bytes: 2200, + estimated_tokens: 550, + token_estimate_method: 'rough_char_div_4', + result: 'passed', + }], + fidelity: { + sees_pixels: true, + includes_accessibility_tree: true, + includes_dom_subset: false, + includes_computed_styles: false, + includes_framework_state: false, + claim_limits: ['No selected-element computed style capture was required for this slot.'], + }, + }; + return { ...bundle, ...overrides }; + } + test('planned-vs-observed comparison satisfies valid scoped proof and fails closed on missing proof', async () => { const mod = await importUiProofModule(); const slots = plannedSlots(); @@ -3226,6 +3439,83 @@ describe('Phase 58 dogfood and Phase 59 UI proof product comparison', () => { assert.match(missingIssue.fix_hint, /observed UI proof bundle/); }); + test('planned runtime capture requirements compare against observed benchmark captures', async () => { + const mod = await importUiProofModule(); + const result = mod.compareUiProofSlots([runtimeCaptureSlot()], [runtimeCaptureBundle()]); + + assert.strictEqual(result.status, 'satisfied', JSON.stringify(result)); + assert.strictEqual(result.slots[0].status, 'satisfied'); + }); + + test('direct-cdp fallback can satisfy runtime capture when availability and claim narrowing are explicit', async () => { + const mod = await importUiProofModule(); + const result = mod.compareUiProofSlots([runtimeCaptureSlot()], [runtimeCaptureBundle({ + evidence_inputs: { kinds: ['code', 'test', 'runtime'], tools_used: ['node:test', 'direct-cdp'] }, + runtime_capture: { + ...runtimeCaptureBundle().runtime_capture, + provider: { + primary: 'agent-browser', + selected: 'direct-cdp', + fallback_chain: ['agent-browser', 'direct-cdp'], + fallback_reason: 'agent-browser unavailable in this runtime; direct-cdp attached to an approved local browser for scoped proof.', + availability: [ + { provider: 'agent-browser', status: 'unavailable' }, + { provider: 'direct-cdp', status: 'available' }, + ], + }, + }, + claim_limits: [ + ...runtimeCaptureBundle().claim_limits, + 'direct-cdp fallback proves scoped local runtime capture only; it does not make direct-cdp the default provider.', + ], + })]); + + assert.strictEqual(result.status, 'satisfied', JSON.stringify(result)); + }); + + test('runtime capture comparison reports missing required modes budget overflow and unexplained fallback', async () => { + const mod = await importUiProofModule(); + const observed = runtimeCaptureBundle({ + runtime_capture: { + ...runtimeCaptureBundle().runtime_capture, + provider: { + primary: 'agent-browser', + selected: 'direct-cdp', + fallback_chain: ['agent-browser', 'direct-cdp'], + availability: [{ provider: 'direct-cdp', status: 'available' }], + }, + captures: [{ + mode: 'screenshot', + slot_ids: ['ui-58-valid-scoped-proof'], + artifact_refs: ['.planning/phases/58-dogfood-ui-proof-loop/artifacts/dogfood-1280.png'], + latency_ms: 420, + raw_bytes: 184224, + text_bytes: 26000, + estimated_tokens: 7000, + result: 'passed', + }], + }, + }); + + const result = mod.compareUiProofSlots([runtimeCaptureSlot({ + runtime_capture_requirements: { + ...runtimeCaptureSlot().runtime_capture_requirements, + required_modes: ['screenshot', 'interactive_snapshot', 'computed_style'], + budgets: { + text_bytes_max: 24000, + estimated_tokens_max: 6000, + raw_artifact_bytes_max: 5000000, + }, + }, + })], [observed]); + + assert.strictEqual(result.status, 'partial'); + assert.ok(result.slots[0].issues.some((issue) => issue.code === 'missing_runtime_capture_mode' && /interactive_snapshot/.test(issue.message))); + assert.ok(result.slots[0].issues.some((issue) => issue.code === 'missing_runtime_capture_mode' && /computed_style/.test(issue.message))); + assert.ok(result.slots[0].issues.some((issue) => issue.code === 'runtime_capture_budget_exceeded')); + assert.ok(result.slots[0].issues.some((issue) => issue.code === 'runtime_capture_fallback_missing_reason')); + }); + test('planned-vs-observed comparison fails closed on weak planned slots', async () => { const mod = await importUiProofModule(); const result = mod.compareUiProofSlots([{ slot_id: 'ui-58-valid-scoped-proof' }], [dogfoodBundle()]); @@ -3450,6 +3740,30 @@ describe('Phase 58 dogfood and Phase 59 UI proof product comparison', () => { assert.deepStrictEqual(output.slots.map((slot) => [slot.slot_id, slot.status]), [['ui-58-valid-scoped-proof', 'satisfied']]); }); + test('fixture runtime capture slots and bundle compare as satisfied without live browser tooling', async () => { + const mod = await importUiProofModule(); + const slots = JSON.parse(fs.readFileSync(path.join(__dirname, '..', 'fixtures', 'ui-proof', 'browser-runtime-capture-slots.json'), 'utf-8')); + const bundle = JSON.parse(fs.readFileSync(path.join(__dirname, '..', 'fixtures', 'ui-proof', 'browser-runtime-capture-bundle.json'), 'utf-8')); + + const result = mod.compareUiProofSlots(slots.ui_proof_slots, [bundle]); + assert.strictEqual(result.status, 'satisfied', JSON.stringify(result)); + assert.deepStrictEqual(result.slots.map((slot) => [slot.slot_id, slot.status]), [ + ['ui-browser-agent-primary', 'satisfied'], + ['ui-browser-direct-cdp-fallback', 'satisfied'], + ]); + }); + + test('fixture runtime capture comparison fails when required screenshot capture is absent', async () => { + const mod = await importUiProofModule(); + const slots = JSON.parse(fs.readFileSync(path.join(__dirname, '..', 'fixtures', 'ui-proof', 'browser-runtime-capture-slots.json'), 'utf-8')); + const bundle = JSON.parse(fs.readFileSync(path.join(__dirname, '..', 'fixtures', 'ui-proof', 'browser-runtime-capture-bundle.json'), 'utf-8')); + bundle.runtime_capture.captures = bundle.runtime_capture.captures.filter((capture) => capture.mode !== 'screenshot'); + + const result = mod.compareUiProofSlots([slots.ui_proof_slots[0]], [bundle]); + assert.strictEqual(result.status, 'partial'); + assert.ok(result.slots[0].issues.some((issue) => issue.code === 'missing_runtime_capture_mode')); + }); + test('Phase 59 ui-proof compare command rejects weak planned slots deterministically', async () => { await runCliAsMain(tmpDir, ['init', '--auto', '--tools', 'agents']); writePlannedSlots([{ slot_id: 'ui-58-valid-scoped-proof' }]); @@ -3809,6 +4123,28 @@ describe('Phase 58 dogfood and Phase 59 UI proof product comparison', () => { assert.deepStrictEqual(output.uiProof.observed, ['.planning/phases/01-ui-proof/proof-bundle.json']); }); + test('phase verify blocks when planned runtime capture requirements lack observed capture metadata', async () => { + await runCliAsMain(tmpDir, ['init', '--auto', '--tools', 'agents']); + const phaseDir = path.join(tmpDir, '.planning', 'phases', '01-ui-proof-capture'); + fs.mkdirSync(phaseDir, { recursive: true }); + fs.writeFileSync(path.join(phaseDir, '01-PLAN.md'), '---\nui_proof_slots:\n - slot_id: ui-58-valid-scoped-proof\n---\n# Phase 1 Plan\n'); + fs.writeFileSync(path.join(phaseDir, '01-SUMMARY.md'), '# Phase 1 Summary\n'); + fs.writeFileSync(path.join(phaseDir, 'ui-proof-slots.json'), JSON.stringify({ ui_proof_slots: [runtimeCaptureSlot()] }, null, 2)); + writeDogfoodFixture(); + fs.copyFileSync( + path.join(tmpDir, '.planning', 'phases', '58-dogfood-ui-proof-loop', 'proof-bundle.json'), + path.join(phaseDir, 'proof-bundle.json') + ); + + const result = await runCliAsMain(tmpDir, ['verify', '1']); + assert.strictEqual(result.exitCode, 1, result.output); + const output = JSON.parse(result.output); + assert.strictEqual(output.verified, false); + assert.deepStrictEqual(output.blocked_on, ['ui_proof']); + assert.strictEqual(output.uiProof.status, 'partial'); + assert.ok(output.uiProof.comparison.slots[0].issues.some((issue) => issue.code === 'missing_runtime_capture')); + }); + test('phase verify ignores stale UI proof sidecars when the plan records no-UI rationale', async () => { await runCliAsMain(tmpDir, ['init', '--auto', '--tools', 'agents']); const phaseDir = path.join(tmpDir, '.planning', 'phases', '01-no-ui-proof');