ServiceNow · jlamypoirier · May 26, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/examples/evaluate_precision/qwen.yaml b/examples/evaluate_precision/qwen.yaml
@@ -0,0 +1,25 @@
+# Precision-evaluation config on Qwen2.5-0.5B — the model used for the Fast-LLM vs DeepSpeed
+# precision-pattern comparison (DeepSpeed side: tools/evaluate_precision_deepspeed.py).
+#
+# Run with:
+#   python -m tools.evaluate_precision -c examples/evaluate_precision/qwen.yaml
+pretrained:
+  path: Qwen/Qwen2.5-0.5B
+  format: qwen2
+output_dir: /tmp/fast_llm_tests/evaluate_precision/qwen_features
+sequence_length: 2048
+variants:
+  # Maps to the DeepSpeed harness's `bf16_head_bf16` (compute bf16, lm head in compute dtype).
+  bf16:
+    model.distributed.compute_dtype: bfloat16
+  # Maps to the DeepSpeed harness's `bf16` (compute bf16, fp32 lm head — the stack default).
+  bf16_fp32_lm_head:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.head.fp32_lm_head: true
+  # Maps to the DeepSpeed harness's `fp16_head_fp16`.
+  fp16:
+    model.distributed.compute_dtype: float16
+  # Maps to the DeepSpeed harness's `fp16`.
+  fp16_fp32_lm_head:
+    model.distributed.compute_dtype: float16
+    model.base_model.head.fp32_lm_head: true
diff --git a/examples/evaluate_precision/sample_text.txt b/examples/evaluate_precision/sample_text.txt
@@ -0,0 +1,23 @@
+The history of computing is often told as a story of ever-smaller and ever-faster machines, but the more interesting thread is the slow accumulation of good abstractions. Early programmers spoke directly to the hardware, toggling switches and rewiring panels, and every problem had to be solved in the vocabulary of the machine in front of them. The arrival of assembly language, and then of compiled languages, did not make the computers any faster; it made the programmers faster, because it let them think in terms closer to the problem and further from the circuitry. Each new layer hid a mess of detail beneath a clean interface, and each clean interface freed the people above it to build something larger than the layer below could have imagined.
+
+Numerical computation followed the same pattern, though its abstractions were mathematical rather than mechanical. The first scientific programs tracked every digit by hand, and a single rounding decision could quietly ruin a long calculation. Floating point arithmetic was a hard-won compromise: it traded a little accuracy for an enormous gain in range and convenience, and it came with rules subtle enough that careful engineers spent entire careers studying them. The promise was never that the answers would be exact, only that the errors would be small and, more importantly, predictable. A method whose errors stay bounded and behave smoothly is far more useful than one that is occasionally perfect and occasionally catastrophic, because predictability is what lets you reason about a system you cannot fully observe.
+
+This distinction between bounded error and occasional disaster runs through the whole of engineering. A bridge is not designed to bear exactly the load it will encounter; it is designed with margins, so that the inevitable surprises fall inside a region the designer has already considered. Software that processes real data is no different. The inputs will be messier than the specification promised, the edge cases will arrive in combinations nobody enumerated, and the only durable defense is to build systems whose failure modes are gentle. A program that degrades gracefully under unexpected input is worth more than one that is flawless on the cases its author happened to imagine, because the world is under no obligation to supply only imaginable cases.
+
+Modern machine learning lives squarely inside this tradition, even when its practitioners do not describe it that way. Training a large model means multiplying enormous matrices billions of times, and the precision of each multiplication is a design choice rather than a fixed fact of nature. Lower precision means smaller numbers to move and faster hardware to move them, but it also means coarser rounding, and the central question is always whether that rounding stays in the harmless regime or crosses into the dangerous one. The answer depends on the model, the data, and the particular sequence of operations involved, which is exactly why it has to be measured rather than assumed. Intuition about numerical behavior is notoriously unreliable at scale, where quantities interact in ways that small examples never reveal.
+
+Consider what happens to a single number as it flows through a deep network. It begins as an input, is scaled and shifted and combined with thousands of its neighbors, passes through a nonlinearity, and emerges as part of the input to the next layer, where the whole process repeats. By the time it reaches the final layer it has been transformed dozens of times, and any error introduced early has had dozens of opportunities to grow or shrink. Sometimes these errors cancel, averaging out across many independent contributions; sometimes they reinforce, when the same systematic bias is applied at every step. The difference between these two fates is the difference between a model that trains stably and one that diverges for reasons its authors struggle to explain.
+
+The output layer deserves special attention, because it is where the model finally commits to a prediction. Up to that point the internal representations are abstract and somewhat forgiving; small perturbations shift them a little without changing their meaning. But the final projection turns those representations into concrete scores over a large vocabulary, and those scores are then exponentiated and normalized into probabilities. Exponentiation is unforgiving of additive error: a small shift in a score becomes a multiplicative change in a probability, and a small change in a probability can flip a decision. This is why the precision of the last step is often discussed out of proportion to its share of the total computation. It is not that the last matrix multiply is expensive; it is that it sits at the most sensitive point in the pipeline.
+
+Yet sensitivity at a single point does not automatically translate into importance for the whole. If the representation arriving at that point already carries substantial error from everything upstream, then cleaning up only the final step yields little, because the dominant error was introduced earlier and is simply passed through. The benefit of high precision at the output is largest exactly when the rest of the pipeline is already clean, and smallest when the upstream is noisy. This is a general principle of error analysis that beginners frequently miss: the value of fixing one stage depends entirely on whether that stage is the bottleneck, and the bottleneck is rarely where attention is first drawn.
+
+There is a further subtlety, which is that the magnitude of the quantities involved changes how much a fixed rounding error matters in relative terms. When a model is confident, the score it assigns to the chosen outcome is close to the maximum, the corresponding log probability is close to zero, and a small absolute error in that log probability is a large fraction of its tiny value. When a model is uncertain, spreading its belief across many outcomes, the same log probability is a large negative number, and the identical absolute error is a negligible fraction of it. The relative importance of a rounding step therefore depends not only on where it sits in the pipeline but on the regime the model is operating in, which is set by the data it happens to be processing at that moment.
+
+This is why measurements that look contradictory are often perfectly consistent once the regime is accounted for. A change that appears to make no difference on one dataset can make a clear difference on another, not because the underlying arithmetic changed, but because the quantities being rounded shifted from one regime to the other. An honest investigation reports both results and the condition that distinguishes them, rather than picking whichever supports a tidy story. The condition is the finding; the individual numbers are only evidence for it.
+
+Reinforcement learning from human feedback adds yet another layer to this picture, because it compares the behavior of two systems rather than examining one in isolation. A model generates text under one implementation and is then evaluated under another, and the learning signal depends on the ratio between the probabilities the two implementations assign to the same tokens. If the two implementations agree, the ratio is near one and the signal is clean; if they disagree systematically, the ratio carries a bias that no amount of careful optimization can remove, because it is baked into the comparison itself. The danger here is not random noise, which averages away over many samples, but systematic disagreement, which does not. Two correct-looking systems can still disagree in a way that quietly corrupts everything built on top of their comparison.
+
+The practical lesson is that matching matters more than absolute accuracy in this setting. It is better for two systems to be wrong in the same way than for one to be right and the other wrong, because a shared error cancels in the ratio while an unshared one does not. This inverts the usual intuition, which prizes accuracy above all. It explains why engineers sometimes deliberately make a fast system reproduce the quirks of a slow one, rather than improving it, and why a change that improves a system in isolation can hurt the larger pipeline it lives in if it breaks an agreement that other parts relied upon. Consistency is a feature, even when it is consistency in imperfection.
+
+All of this argues for a particular discipline: measure the thing you actually care about, under the conditions it will actually face, and report the conditions alongside the numbers. Good measurement, like a good abstraction, is what lets us trust the layers we cannot see. It does not eliminate uncertainty, but it bounds it, and a bounded uncertainty is something an engineer can build on. The goal is never to pretend the errors are gone. The goal is to know how large they are, where they come from, and whether they stay in the gentle regime or threaten to cross into the steep one where small causes produce large and unwelcome effects.
diff --git a/examples/evaluate_precision/smol.yaml b/examples/evaluate_precision/smol.yaml
@@ -0,0 +1,59 @@
+# Example precision-evaluation config: sweep precision-stability features on SmolLM2-135M.
+#
+# Run with:
+#   python -m tools.evaluate_precision -c examples/evaluate_precision/smol.yaml
+#
+# `pretrained.path` accepts either a local checkpoint directory or a HF Hub model id
+# (auto-downloaded via `huggingface_hub.snapshot_download` on first use).
+pretrained:
+  path: HuggingFaceTB/SmolLM2-135M
+  format: llama
+output_dir: /tmp/fast_llm_tests/evaluate_precision/features
+sequence_length: 2048
+variants:
+  # Baseline bf16: compute_dtype=bf16 + Fast-LLM defaults (fp32 gradient accumulation, bf16 residual, bf16 lm_head).
+  bf16:
+    model.distributed.compute_dtype: bfloat16
+  # Turn ON full-precision residual stream.
+  bf16_fp32_residual:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.embeddings.full_precision_residual: true
+  # Turn ON fp32 LM head matmul (PR #526).
+  bf16_fp32_lm_head:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.head.fp32_lm_head: true
+  # Both stability features on (most precise bf16-compute configuration).
+  bf16_max_precision:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.embeddings.full_precision_residual: true
+    model.base_model.head.fp32_lm_head: true
+  # Diagnostic: enable bf16 reduced-precision reductions in cuBLAS GEMMs. Tests whether the
+  # within-engine bf16-vs-fp32 gap is sensitive to the partial-sum reduction precision (the
+  # MMA accumulator is fp32 by hardware on H100/A100; this flag affects split-K reductions).
+  bf16_reduced_reduction:
+    model.distributed.compute_dtype: bfloat16
+    _torch_backend.cuda.matmul.allow_bf16_reduced_precision_reduction: true
+  # Diagnostic: simulate a "bf16 inputs, fp32 output" lm-head matmul kernel. fp32_lm_head=True
+  # upcasts inputs+weights to fp32, then matmul_precision='medium' runs the matmul through
+  # bf16 Tensor Cores anyway, then logits stay fp32. Tests whether fp32_lm_head's gain comes
+  # from input precision or from skipping the bf16 output cast.
+  bf16_in_fp32_out:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.head.fp32_lm_head: true
+    _torch_matmul_precision: medium
+  # fp16 sweep: probes whether the precision-vs-noise picture (rms noise ~0.1 nats per token
+  # for bf16) shrinks ~8× for fp16 (10 mantissa bits vs 7), as the literature's "switch to
+  # fp16" recommendation implies. Default dynamic grad-scaler (initial 2^16) is uniform
+  # across variants, so relative comparisons stay meaningful.
+  fp16:
+    model.distributed.compute_dtype: float16
+  fp16_fp32_residual:
+    model.distributed.compute_dtype: float16
+    model.base_model.embeddings.full_precision_residual: true
+  fp16_fp32_lm_head:
+    model.distributed.compute_dtype: float16
+    model.base_model.head.fp32_lm_head: true
+  fp16_max_precision:
+    model.distributed.compute_dtype: float16
+    model.base_model.embeddings.full_precision_residual: true
+    model.base_model.head.fp32_lm_head: true
diff --git a/examples/evaluate_precision/smol_gspo.yaml b/examples/evaluate_precision/smol_gspo.yaml
@@ -0,0 +1,52 @@
+# Example precision-evaluation config: sweep precision-stability features on SmolLM2-135M
+# with the GSPO policy-gradient loss (uses advantages and old log-probabilities).
+#
+# Run with:
+#   python -m tools.evaluate_precision -c examples/evaluate_precision/smol_gspo.yaml
+#
+# `pretrained.path` accepts either a local checkpoint directory or a HF Hub model id
+# (auto-downloaded via `huggingface_hub.snapshot_download` on first use).
+pretrained:
+  path: HuggingFaceTB/SmolLM2-135M
+  format: llama
+model:
+  base_model:
+    head:
+      losses:
+        gspo:
+          type: gspo
+output_dir: /tmp/fast_llm_tests/evaluate_precision/gspo
+data_path: /tmp/fast_llm_tests/evaluate_precision/gspo_data
+sequence_length: 2048
+variants:
+  bf16:
+    model.distributed.compute_dtype: bfloat16
+  bf16_fp32_residual:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.embeddings.full_precision_residual: true
+  bf16_fp32_lm_head:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.head.fp32_lm_head: true
+  bf16_max_precision:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.embeddings.full_precision_residual: true
+    model.base_model.head.fp32_lm_head: true
+  bf16_reduced_reduction:
+    model.distributed.compute_dtype: bfloat16
+    _torch_backend.cuda.matmul.allow_bf16_reduced_precision_reduction: true
+  bf16_in_fp32_out:
+    model.distributed.compute_dtype: bfloat16
+    model.base_model.head.fp32_lm_head: true
+    _torch_matmul_precision: medium
+  fp16:
+    model.distributed.compute_dtype: float16
+  fp16_fp32_residual:
+    model.distributed.compute_dtype: float16
+    model.base_model.embeddings.full_precision_residual: true
+  fp16_fp32_lm_head:
+    model.distributed.compute_dtype: float16
+    model.base_model.head.fp32_lm_head: true
+  fp16_max_precision:
+    model.distributed.compute_dtype: float16
+    model.base_model.embeddings.full_precision_residual: true
+    model.base_model.head.fp32_lm_head: true
diff --git a/fast_llm/data/document/config.py b/fast_llm/data/document/config.py
@@ -80,6 +80,12 @@ class LanguageModelBatchPreprocessingConfig(TokenPreprocessingConfig):
     use_preference_spans: bool = Field(default=False)
     use_grpo_data: bool = Field(default=False)
     return_label_counts: bool = Field(default=False)
+    output_hidden_states: list[str] = Field(
+        default_factory=list,
+        desc="Regex patterns to add to each model input's `output_hidden_states` set."
+        " Matching `_debug`-named tensors get populated into `kwargs[hidden_states]`"
+        " and (when running under a `Run` context) emitted into `tensor_logs`.",
+    )
 
     def _validate(self) -> None:
         super()._validate()

diff --git a/fast_llm/data/document/language_model.py b/fast_llm/data/document/language_model.py
@@ -161,6 +161,13 @@ def get_model_inputs(self, config: LanguageModelBatchPreprocessingConfig) -> lis
 
         self._set_target_inputs(model_inputs, config)
 
+        if config.output_hidden_states:
+            import re
+
+            patterns = {re.compile(pattern) for pattern in config.output_hidden_states}
+            for model_input in model_inputs:
+                model_input.output_hidden_states.update(patterns)
+
         return model_inputs
 
     def _set_target_inputs(