diff --git a/problems/linalg/qr_v2/README.md b/problems/linalg/qr_v2/README.md
new file mode 100644
index 000000000..bd7ecf0d1
--- /dev/null
+++ b/problems/linalg/qr_v2/README.md
@@ -0,0 +1,68 @@
+# QR v2 conditioning hardening
+
+QR v2 intentionally allows implementations to choose different internal
+precision strategies. The correctness contract is about the returned FP32
+compact Householder factors, not about forbidding FP16, FP8, NVFP4, or
+shape-specific dispatch inside the kernel.
+
+The hardening target is narrower: do not let the benchmark reward a solution
+that picks precision only from the public shape and ignores the numerical
+quality of the actual matrices in that shape.
+
+## Pattern
+
+Use the same public shape with several matrix distributions:
+
+- `dense` for the common well-conditioned path.
+- `rankdef`, `nearrank`, `clustered`, `band`, `rowscale`, and
+  `nearcollinear` for numerical stress.
+- `mixed` for heterogeneous batches where each matrix independently receives a
+  profile at a random batch position.
+
+This means a `batch x 512 x 512` benchmark can contain easy and hard matrices in
+the same call. A submission can still specialize for `n = 512`, but it has to
+either use a robust path for the whole batch or inspect matrix quality before
+routing individual work. Pure shape-only low-precision routing is much less
+likely to pass, and pure shape-only high-precision routing pays the ranked cost
+on easy matrices.
+
+## Smoke checks
+
+The `tests:` list is useful for local iteration, but it should be treated as a
+smoke path. Contestants can run it before submitting, and some workflows may use
+it as a fast correctness check, but it is not enough to enforce the competition
+contract by itself.
+
+Do not rely on `tests:` to prevent shape-only precision routing. Any behavior
+that must affect eligibility or rank needs to appear in `benchmarks:`, because
+leaderboard mode validates benchmark outputs.
+
+## Ranked checks
+
+The `benchmarks:` list includes both:
+
+- production-like dense shapes, so fast low-precision paths still matter when
+  they are numerically valid; and
+- mixed or homogeneous stress shapes, so robustness is part of the score rather
+  than only a pass/fail gate.
+
+QR v2 includes ranked `512 x 512` stress benchmarks for `mixed`, `rankdef`,
+`clustered`, `rowscale`, and `nearcollinear` distributions. Those cases are the
+main guard against routing precision purely from the public shape.
+
+The evaluator combines each public seed with `POPCORN_SEED` when the service
+sets it. That keeps runs reproducible while preventing exact input matrices from
+being fully determined by the public `task.yml`.
+
+## Extending the suite
+
+When adding a new QR shape or distribution, prefer adding it in pairs:
+
+1. A representative `benchmarks:` case when the behavior should affect
+   eligibility or ranking.
+2. An optional cheap `tests:` smoke case if it helps local iteration.
+
+For compute-heavy stress distributions, prefer one mixed benchmark before adding
+many homogeneous stress cases. If a homogeneous case is important enough to
+change the competition outcome, put it in `benchmarks:` first and only then add
+a smaller `tests:` proxy for convenience.
diff --git a/problems/linalg/qr_v2/task.yml b/problems/linalg/qr_v2/task.yml
index 0da22d88a..f065ad48e 100644
--- a/problems/linalg/qr_v2/task.yml
+++ b/problems/linalg/qr_v2/task.yml
@@ -51,6 +51,9 @@ description: |
   well-conditioned, and route it to a path that is only valid for well-conditioned
   inputs, and the runtime cost of the accurate path on hard inputs is part of the
   score. Each matrix must be factored correctly on its own merits.
+  Local `test` mode is a smoke path for contestants, not the only competition
+  gate; shape-only numerical routing must be covered by `benchmarks`, because
+  leaderboard mode validates benchmark outputs and those cases determine rank.
 
   Correctness is a hard gate against the original FP32 input and the FP32
   `torch.geqrf` compact-factor contract. Low-bit FP16, FP8, or NVFP4 work is
@@ -119,3 +122,5 @@ benchmarks:
   - {"batch": 640, "n": 512, "cond": 0, "seed": 770003, "case": "rankdef"}
   - {"batch": 640, "n": 512, "cond": 0, "seed": 770004, "case": "clustered"}
   - {"batch": 60, "n": 1024, "cond": 0, "seed": 770005, "case": "nearrank"}
+  - {"batch": 640, "n": 512, "cond": 0, "seed": 770006, "case": "rowscale"}
+  - {"batch": 640, "n": 512, "cond": 0, "seed": 770007, "case": "nearcollinear"}