GraphDone · mvalancy · Jun 16, 2026 · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
diff --git a/.env.test.example b/.env.test.example
@@ -0,0 +1,42 @@
+# GraphDone test-pipeline configuration — LOCAL ONLY.
+#
+# Copy this file to `.env.test.local` (which is gitignored) and fill in your
+# own values. NEVER put real hostnames, IPs, or keys in this committed example
+# or anywhere else in the repo — the local VLM boxes (GPU workstations) must
+# stay out of version control. The test harness auto-loads `.env.test.local`.
+#
+#   cp .env.test.example .env.test.local   # then edit .env.test.local
+
+# --- Local Vision-Language-Model (VLM) endpoints ---------------------------
+# Comma-separated base URLs of your local VLM server(s). Requests are
+# round-robined across them so visual evaluation is spread over every GPU.
+# Leave blank to skip all VLM-driven suites (they no-op cleanly in CI).
+# Example shape (use your OWN hosts in .env.test.local, never here):
+#   VLM_ENDPOINTS=http://<gpu-host-a>:<port>,http://<gpu-host-b>:<port>
+VLM_ENDPOINTS=
+
+# Model id/tag to request (e.g. a llava / qwen2-vl / llama-3.2-vision build).
+VLM_MODEL=
+
+# Optional bearer key for OpenAI-compatible servers that require one.
+VLM_API_KEY=
+
+# Wire protocol: auto (default) | openai | ollama.
+#   auto   — probe each endpoint: /v1/models => OpenAI-compatible, else Ollama.
+#   openai — POST /v1/chat/completions (vLLM, LM Studio, llama.cpp, Ollama compat)
+#   ollama — POST /api/chat with an images[] array
+VLM_PROTOCOL=auto
+
+# Max concurrent VLM requests across all endpoints (default 3).
+VLM_MAX_CONCURRENCY=3
+
+# Per-request timeout in ms — VLMs can be slow on large images (default 120000).
+VLM_TIMEOUT_MS=120000
+
+# --- Large-scale performance sweep -----------------------------------------
+# Node counts to sweep, comma-separated. Leave blank to use the built-in
+# default (small in CI, large locally). Example: 50,200,500,1000,2000
+SCALE_SWEEP_SIZES=
+
+# Quality tiers to sweep per size (subset of LOW,MEDIUM,HIGH,ULTRA).
+SCALE_SWEEP_QUALITIES=HIGH,ULTRA
diff --git a/docs/SYSTEMS.md b/docs/SYSTEMS.md
@@ -19,6 +19,8 @@
 | Lint | `npm run lint` | 0 errors (warnings allowed) |
 | Build | `npm run build` | Production build succeeds |
 | Showcase report | `TEST_URL=http://localhost:3127 npm run report:showcase` | Records .webm video + screenshots of every mode at all 5 resolutions → `test-artifacts/showcase/index.html` (also an every-PR CI artifact). |
+| Large-scale perf sweep | `TEST_URL=http://localhost:3127 npm run test:perf:scale` | Seeds graphs of increasing size (50→2000+ nodes) and records `window.__graphPerf` (settle, tick, fps, drift, query p95) across size × quality → `test-artifacts/scale-sweep/index.html`. Report-only; sizes/qualities via `.env.test.local`. See [docs/testing/local-vlm-and-scale.md](./testing/local-vlm-and-scale.md). |
+| Local VLM visual review | `TEST_URL=http://localhost:3127 npm run test:vlm` | A locally-hosted vision model judges captured states from 4 perspectives (visual defects, new-user clarity, accessibility, living-graph aliveness) → `test-artifacts/vlm/index.html`. **Skips unless `VLM_ENDPOINTS` is set in the gitignored `.env.test.local`** (CI can't reach local GPUs). Report-only. |
 
 **Why THE GATE exists:** a real incident — orphaned `Edge` records made the
 edges query 500 and the UI showed "Error" with zero edges, while every unit

diff --git a/docs/testing/local-vlm-and-scale.md b/docs/testing/local-vlm-and-scale.md
@@ -0,0 +1,113 @@
+# Local VLM visual review & large-scale performance sweeps
+
+Two heavier, report-only suites that exercise GraphDone from realistic user
+perspectives and at scale. Both are **opt-in and run locally** (or on a
+self-hosted runner) because the vision models live on your own GPU boxes —
+their hostnames must never enter the repo.
+
+## TL;DR
+
+```bash
+cp .env.test.example .env.test.local     # gitignored — put your real values here
+# edit .env.test.local: VLM_ENDPOINTS, VLM_MODEL, (optional) sweep sizes
+
+./start dev                               # or have the stack running on :3127
+
+TEST_URL=http://localhost:3127 npm run test:perf:scale   # → test-artifacts/scale-sweep/index.html
+TEST_URL=http://localhost:3127 npm run test:vlm          # → test-artifacts/vlm/index.html
+```
+
+If `VLM_ENDPOINTS` is unset, `test:vlm` **skips cleanly** — so CI and other
+developers are never blocked by hardware they don't have.
+
+## Keeping hostnames out of the repo
+
+- **Never** commit hostnames, IPs, or keys. The GPU boxes (e.g. an RTX 4090
+  workstation and Grace-Blackwell nodes) are referenced only by env vars.
+- `.env.test.local` is gitignored (see `.gitignore`). It is the *only* place
+  your real endpoints live.
+- `.env.test.example` is committed and documents the variable **names** with
+  placeholder hosts (`http://<gpu-host>:<port>`). Copy it to `.env.test.local`
+  and fill in the rest.
+- The harness auto-loads `.env.test.local` via `tests/helpers/testEnv.ts`.
+
+```bash
+# .env.test.local  (NOT committed)
+VLM_ENDPOINTS=http://<host-a>:<port>,http://<host-b>:<port>,http://<host-c>:<port>
+VLM_MODEL=<your-vision-model-tag>
+VLM_PROTOCOL=auto        # auto | openai | ollama
+VLM_MAX_CONCURRENCY=3
+```
+
+Multiple endpoints are **round-robined**, so visual evaluation spreads across
+every GPU you list.
+
+## VLM protocol support
+
+`tests/helpers/vlm.ts` is protocol-agnostic and auto-detects per endpoint:
+
+| Protocol | Detected via | Request |
+|----------|--------------|---------|
+| OpenAI-compatible | `GET /v1/models` | `POST /v1/chat/completions` with an `image_url` data URI (vLLM, LM Studio, llama.cpp server, Ollama's `/v1` shim) |
+| Ollama native | `GET /api/tags` | `POST /api/chat` with a base64 `images[]` array |
+
+Force one with `VLM_PROTOCOL=openai` or `ollama`. Each model call asks for a
+strict JSON verdict `{pass, score, issues[], summary}`, parsed leniently.
+
+### Personas
+
+Each captured screenshot is judged from several perspectives (see `PERSONAS`
+in `tests/helpers/vlm.ts`):
+
+- **Visual defects** — overlapping/cut-off nodes, unreadable labels, broken
+  layout, missing edges, error chrome.
+- **New-user clarity** — is the screen legible and inviting to a newcomer?
+- **Accessibility** — contrast, text size, color-only signals, target size.
+- **Living-graph aliveness** — do glow/breathe/flow status cues read clearly?
+
+Report-only: a **FLAG** is the model's subjective concern, surfaced for a human
+to look at — it never fails the build. The suite *does* assert the model
+answered, so a broken client is still caught.
+
+## Large-scale perf sweep
+
+`tests/perf/scale-sweep.spec.ts` seeds real graphs (via the GraphQL API, the
+same path a human/AI uses) of increasing size, loads each at one or more
+quality tiers, and records the in-app `window.__graphPerf` readings plus load
+time, settle time and query latency.
+
+```bash
+# .env.test.local
+SCALE_SWEEP_SIZES=50,200,500,1000,2000     # blank => small in CI, large locally
+SCALE_SWEEP_QUALITIES=HIGH,ULTRA
+```
+
+Metrics per (size, quality):
+
+- **Reliable (measured directly from the browser, captured at every size):**
+  rendered node/edge counts, initial load ms, graph-scoped query p95, and
+  **interaction FPS** — real rendered frames/sec while a node is dragged
+  (counted via `requestAnimationFrame`, so it needs no app instrumentation and
+  reflects how janky the graph feels under interaction at scale).
+- **Best-effort bonus (from the app's `window.__graphPerf`, which only
+  publishes ~every 2s while the sim ticks):** settle ms (to `alpha ≤ 0.02`),
+  avg/p95 sim tick ms, layout drift (`rmsFromSavedPx`). These can be blank for
+  graphs that settle instantly — `interactionFps` is the headline signal.
+
+A FRESH graph is seeded per (size, quality) so each measurement starts from an
+unsettled layout (otherwise the second quality loads the first run's settled,
+pinned positions and the sim never ticks). Output:
+`test-artifacts/scale-sweep/index.html` — a table plus inline SVG charts of how
+each metric scales, with the `@perf` budgets drawn for reference.
+
+Report-only; the only hard assertion is that a seeded graph actually renders.
+Each seeded graph is deleted afterward (edges first, then nodes, then graph).
+
+## CI
+
+GitHub-hosted runners can't reach your local GPUs, so neither suite gates
+merges there. To gate on them, register a **self-hosted runner** on a machine
+that can reach the endpoints, give it the `.env.test.local`, and add a workflow
+job (manual-dispatch or nightly) that runs `npm run test:perf:scale` /
+`npm run test:vlm`. The scale sweep alone (no VLM) is safe to run on any runner
+with the dev stack and a small `SCALE_SWEEP_SIZES`.