Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
02fd39c
Add tool to evaluate layer-wise numerical-error propagation
jlamypoirier May 26, 2026
4dd6c14
Collapse to a single config; require a checkpoint
jlamypoirier May 27, 2026
5ebea33
Expose `model:` alongside `pretrained:` in the tool config
jlamypoirier May 27, 2026
4c444d8
Inherit PretrainedGPTModelConfig; use Config update mechanism
jlamypoirier May 27, 2026
35206a6
Expand HF metadata allowlist for newer transformers configs
jlamypoirier May 27, 2026
bde1efa
Reshape console table for readability
jlamypoirier May 27, 2026
8099b51
Merge tensor+kind, fix decimal precision in console table
jlamypoirier May 27, 2026
dbd7702
Switch back to fixed-decimal formatting in the table
jlamypoirier May 27, 2026
152ffc3
Wipe per-variant experiment dir before each run
jlamypoirier May 27, 2026
7e98500
Support pre-generated memmap dataset; misc table-format polish
jlamypoirier May 28, 2026
173ae0d
Print per-variant summary at the end of the run
jlamypoirier May 28, 2026
005fd62
Reshape end-of-run summary: variants × aggregations, relative only
jlamypoirier May 28, 2026
c594658
Clarify intermediate aggregation in summary header
jlamypoirier May 28, 2026
3159f73
Split summary across fw/bw rows; one extra precision digit
jlamypoirier May 28, 2026
6ef153e
Two-row column header in summary; chronological column order
jlamypoirier May 28, 2026
7327932
Add fp32_lm_head flag for vLLM precision parity
jlamypoirier May 28, 2026
76335df
Extract layer-name labels for summary first/last columns
jlamypoirier May 28, 2026
8122946
Add `debug_hidden_states_log` to capture named tensors via output_hid…
jlamypoirier May 28, 2026
4633bfd
Capture logit gradients; expose them in the summary
jlamypoirier May 28, 2026
9ca1711
Place logits after head in bw summary; widen format for sub-percent v…
jlamypoirier May 28, 2026
f2655f3
Pick per-column decimals to guarantee ≥2 sig figs
jlamypoirier May 28, 2026
7f8ef96
Tighten summary table spacing
jlamypoirier May 28, 2026
08b1637
Support HF Hub model ids in pretrained.path
jlamypoirier May 28, 2026
77eae22
Add example precision-evaluation configs
jlamypoirier May 28, 2026
efa95b1
Drop bf16_no_fp32_gradients variant from example configs
jlamypoirier May 28, 2026
46bc5b8
Add weight gradients to per-variant report tables
jlamypoirier May 28, 2026
bef2f0d
Separate fw/bw/grad rows in per-variant tables
jlamypoirier May 28, 2026
4fecad4
Split summary into three tables (fw, bw, grad)
jlamypoirier May 28, 2026
4f47dc0
Split grad summary by parameter category
jlamypoirier May 28, 2026
5198c25
Per-tensor sample-density overrides in TensorLogsConfig
jlamypoirier May 28, 2026
312343e
Chosen-logprob loss, per-variant grad-scale auto-calibration, fp16 va…
jlamypoirier May 29, 2026
497c76c
Lean fixed-input runner + DeepSpeed-side precision comparison
jlamypoirier Jun 1, 2026
cecf7ae
Support random-init (model_weights=False) in both precision tools
jlamypoirier Jun 1, 2026
a6d9314
vLLM within-engine precision tool
jlamypoirier Jun 1, 2026
26cd8ab
Auto-bind vLLM fp32 head on tied-embedding models
jlamypoirier Jun 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions examples/evaluate_precision/qwen.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Precision-evaluation config on Qwen2.5-0.5B — the model used for the Fast-LLM vs DeepSpeed
# precision-pattern comparison (DeepSpeed side: tools/evaluate_precision_deepspeed.py).
#
# Run with:
# python -m tools.evaluate_precision -c examples/evaluate_precision/qwen.yaml
pretrained:
path: Qwen/Qwen2.5-0.5B
format: qwen2
output_dir: /tmp/fast_llm_tests/evaluate_precision/qwen_features
sequence_length: 2048
variants:
# Maps to the DeepSpeed harness's `bf16_head_bf16` (compute bf16, lm head in compute dtype).
bf16:
model.distributed.compute_dtype: bfloat16
# Maps to the DeepSpeed harness's `bf16` (compute bf16, fp32 lm head — the stack default).
bf16_fp32_lm_head:
model.distributed.compute_dtype: bfloat16
model.base_model.head.fp32_lm_head: true
# Maps to the DeepSpeed harness's `fp16_head_fp16`.
fp16:
model.distributed.compute_dtype: float16
# Maps to the DeepSpeed harness's `fp16`.
fp16_fp32_lm_head:
model.distributed.compute_dtype: float16
model.base_model.head.fp32_lm_head: true
23 changes: 23 additions & 0 deletions examples/evaluate_precision/sample_text.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
The history of computing is often told as a story of ever-smaller and ever-faster machines, but the more interesting thread is the slow accumulation of good abstractions. Early programmers spoke directly to the hardware, toggling switches and rewiring panels, and every problem had to be solved in the vocabulary of the machine in front of them. The arrival of assembly language, and then of compiled languages, did not make the computers any faster; it made the programmers faster, because it let them think in terms closer to the problem and further from the circuitry. Each new layer hid a mess of detail beneath a clean interface, and each clean interface freed the people above it to build something larger than the layer below could have imagined.

Numerical computation followed the same pattern, though its abstractions were mathematical rather than mechanical. The first scientific programs tracked every digit by hand, and a single rounding decision could quietly ruin a long calculation. Floating point arithmetic was a hard-won compromise: it traded a little accuracy for an enormous gain in range and convenience, and it came with rules subtle enough that careful engineers spent entire careers studying them. The promise was never that the answers would be exact, only that the errors would be small and, more importantly, predictable. A method whose errors stay bounded and behave smoothly is far more useful than one that is occasionally perfect and occasionally catastrophic, because predictability is what lets you reason about a system you cannot fully observe.

This distinction between bounded error and occasional disaster runs through the whole of engineering. A bridge is not designed to bear exactly the load it will encounter; it is designed with margins, so that the inevitable surprises fall inside a region the designer has already considered. Software that processes real data is no different. The inputs will be messier than the specification promised, the edge cases will arrive in combinations nobody enumerated, and the only durable defense is to build systems whose failure modes are gentle. A program that degrades gracefully under unexpected input is worth more than one that is flawless on the cases its author happened to imagine, because the world is under no obligation to supply only imaginable cases.

Modern machine learning lives squarely inside this tradition, even when its practitioners do not describe it that way. Training a large model means multiplying enormous matrices billions of times, and the precision of each multiplication is a design choice rather than a fixed fact of nature. Lower precision means smaller numbers to move and faster hardware to move them, but it also means coarser rounding, and the central question is always whether that rounding stays in the harmless regime or crosses into the dangerous one. The answer depends on the model, the data, and the particular sequence of operations involved, which is exactly why it has to be measured rather than assumed. Intuition about numerical behavior is notoriously unreliable at scale, where quantities interact in ways that small examples never reveal.

Consider what happens to a single number as it flows through a deep network. It begins as an input, is scaled and shifted and combined with thousands of its neighbors, passes through a nonlinearity, and emerges as part of the input to the next layer, where the whole process repeats. By the time it reaches the final layer it has been transformed dozens of times, and any error introduced early has had dozens of opportunities to grow or shrink. Sometimes these errors cancel, averaging out across many independent contributions; sometimes they reinforce, when the same systematic bias is applied at every step. The difference between these two fates is the difference between a model that trains stably and one that diverges for reasons its authors struggle to explain.

The output layer deserves special attention, because it is where the model finally commits to a prediction. Up to that point the internal representations are abstract and somewhat forgiving; small perturbations shift them a little without changing their meaning. But the final projection turns those representations into concrete scores over a large vocabulary, and those scores are then exponentiated and normalized into probabilities. Exponentiation is unforgiving of additive error: a small shift in a score becomes a multiplicative change in a probability, and a small change in a probability can flip a decision. This is why the precision of the last step is often discussed out of proportion to its share of the total computation. It is not that the last matrix multiply is expensive; it is that it sits at the most sensitive point in the pipeline.

Yet sensitivity at a single point does not automatically translate into importance for the whole. If the representation arriving at that point already carries substantial error from everything upstream, then cleaning up only the final step yields little, because the dominant error was introduced earlier and is simply passed through. The benefit of high precision at the output is largest exactly when the rest of the pipeline is already clean, and smallest when the upstream is noisy. This is a general principle of error analysis that beginners frequently miss: the value of fixing one stage depends entirely on whether that stage is the bottleneck, and the bottleneck is rarely where attention is first drawn.

There is a further subtlety, which is that the magnitude of the quantities involved changes how much a fixed rounding error matters in relative terms. When a model is confident, the score it assigns to the chosen outcome is close to the maximum, the corresponding log probability is close to zero, and a small absolute error in that log probability is a large fraction of its tiny value. When a model is uncertain, spreading its belief across many outcomes, the same log probability is a large negative number, and the identical absolute error is a negligible fraction of it. The relative importance of a rounding step therefore depends not only on where it sits in the pipeline but on the regime the model is operating in, which is set by the data it happens to be processing at that moment.

This is why measurements that look contradictory are often perfectly consistent once the regime is accounted for. A change that appears to make no difference on one dataset can make a clear difference on another, not because the underlying arithmetic changed, but because the quantities being rounded shifted from one regime to the other. An honest investigation reports both results and the condition that distinguishes them, rather than picking whichever supports a tidy story. The condition is the finding; the individual numbers are only evidence for it.

Reinforcement learning from human feedback adds yet another layer to this picture, because it compares the behavior of two systems rather than examining one in isolation. A model generates text under one implementation and is then evaluated under another, and the learning signal depends on the ratio between the probabilities the two implementations assign to the same tokens. If the two implementations agree, the ratio is near one and the signal is clean; if they disagree systematically, the ratio carries a bias that no amount of careful optimization can remove, because it is baked into the comparison itself. The danger here is not random noise, which averages away over many samples, but systematic disagreement, which does not. Two correct-looking systems can still disagree in a way that quietly corrupts everything built on top of their comparison.

The practical lesson is that matching matters more than absolute accuracy in this setting. It is better for two systems to be wrong in the same way than for one to be right and the other wrong, because a shared error cancels in the ratio while an unshared one does not. This inverts the usual intuition, which prizes accuracy above all. It explains why engineers sometimes deliberately make a fast system reproduce the quirks of a slow one, rather than improving it, and why a change that improves a system in isolation can hurt the larger pipeline it lives in if it breaks an agreement that other parts relied upon. Consistency is a feature, even when it is consistency in imperfection.

All of this argues for a particular discipline: measure the thing you actually care about, under the conditions it will actually face, and report the conditions alongside the numbers. Good measurement, like a good abstraction, is what lets us trust the layers we cannot see. It does not eliminate uncertainty, but it bounds it, and a bounded uncertainty is something an engineer can build on. The goal is never to pretend the errors are gone. The goal is to know how large they are, where they come from, and whether they stay in the gentle regime or threaten to cross into the steep one where small causes produce large and unwelcome effects.
59 changes: 59 additions & 0 deletions examples/evaluate_precision/smol.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Example precision-evaluation config: sweep precision-stability features on SmolLM2-135M.
#
# Run with:
# python -m tools.evaluate_precision -c examples/evaluate_precision/smol.yaml
#
# `pretrained.path` accepts either a local checkpoint directory or a HF Hub model id
# (auto-downloaded via `huggingface_hub.snapshot_download` on first use).
pretrained:
path: HuggingFaceTB/SmolLM2-135M
format: llama
output_dir: /tmp/fast_llm_tests/evaluate_precision/features
sequence_length: 2048
variants:
# Baseline bf16: compute_dtype=bf16 + Fast-LLM defaults (fp32 gradient accumulation, bf16 residual, bf16 lm_head).
bf16:
model.distributed.compute_dtype: bfloat16
# Turn ON full-precision residual stream.
bf16_fp32_residual:
model.distributed.compute_dtype: bfloat16
model.base_model.embeddings.full_precision_residual: true
# Turn ON fp32 LM head matmul (PR #526).
bf16_fp32_lm_head:
model.distributed.compute_dtype: bfloat16
model.base_model.head.fp32_lm_head: true
# Both stability features on (most precise bf16-compute configuration).
bf16_max_precision:
model.distributed.compute_dtype: bfloat16
model.base_model.embeddings.full_precision_residual: true
model.base_model.head.fp32_lm_head: true
# Diagnostic: enable bf16 reduced-precision reductions in cuBLAS GEMMs. Tests whether the
# within-engine bf16-vs-fp32 gap is sensitive to the partial-sum reduction precision (the
# MMA accumulator is fp32 by hardware on H100/A100; this flag affects split-K reductions).
bf16_reduced_reduction:
model.distributed.compute_dtype: bfloat16
_torch_backend.cuda.matmul.allow_bf16_reduced_precision_reduction: true
# Diagnostic: simulate a "bf16 inputs, fp32 output" lm-head matmul kernel. fp32_lm_head=True
# upcasts inputs+weights to fp32, then matmul_precision='medium' runs the matmul through
# bf16 Tensor Cores anyway, then logits stay fp32. Tests whether fp32_lm_head's gain comes
# from input precision or from skipping the bf16 output cast.
bf16_in_fp32_out:
model.distributed.compute_dtype: bfloat16
model.base_model.head.fp32_lm_head: true
_torch_matmul_precision: medium
# fp16 sweep: probes whether the precision-vs-noise picture (rms noise ~0.1 nats per token
# for bf16) shrinks ~8× for fp16 (10 mantissa bits vs 7), as the literature's "switch to
# fp16" recommendation implies. Default dynamic grad-scaler (initial 2^16) is uniform
# across variants, so relative comparisons stay meaningful.
fp16:
model.distributed.compute_dtype: float16
fp16_fp32_residual:
model.distributed.compute_dtype: float16
model.base_model.embeddings.full_precision_residual: true
fp16_fp32_lm_head:
model.distributed.compute_dtype: float16
model.base_model.head.fp32_lm_head: true
fp16_max_precision:
model.distributed.compute_dtype: float16
model.base_model.embeddings.full_precision_residual: true
model.base_model.head.fp32_lm_head: true
52 changes: 52 additions & 0 deletions examples/evaluate_precision/smol_gspo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Example precision-evaluation config: sweep precision-stability features on SmolLM2-135M
# with the GSPO policy-gradient loss (uses advantages and old log-probabilities).
#
# Run with:
# python -m tools.evaluate_precision -c examples/evaluate_precision/smol_gspo.yaml
#
# `pretrained.path` accepts either a local checkpoint directory or a HF Hub model id
# (auto-downloaded via `huggingface_hub.snapshot_download` on first use).
pretrained:
path: HuggingFaceTB/SmolLM2-135M
format: llama
model:
base_model:
head:
losses:
gspo:
type: gspo
output_dir: /tmp/fast_llm_tests/evaluate_precision/gspo
data_path: /tmp/fast_llm_tests/evaluate_precision/gspo_data
sequence_length: 2048
variants:
bf16:
model.distributed.compute_dtype: bfloat16
bf16_fp32_residual:
model.distributed.compute_dtype: bfloat16
model.base_model.embeddings.full_precision_residual: true
bf16_fp32_lm_head:
model.distributed.compute_dtype: bfloat16
model.base_model.head.fp32_lm_head: true
bf16_max_precision:
model.distributed.compute_dtype: bfloat16
model.base_model.embeddings.full_precision_residual: true
model.base_model.head.fp32_lm_head: true
bf16_reduced_reduction:
model.distributed.compute_dtype: bfloat16
_torch_backend.cuda.matmul.allow_bf16_reduced_precision_reduction: true
bf16_in_fp32_out:
model.distributed.compute_dtype: bfloat16
model.base_model.head.fp32_lm_head: true
_torch_matmul_precision: medium
fp16:
model.distributed.compute_dtype: float16
fp16_fp32_residual:
model.distributed.compute_dtype: float16
model.base_model.embeddings.full_precision_residual: true
fp16_fp32_lm_head:
model.distributed.compute_dtype: float16
model.base_model.head.fp32_lm_head: true
fp16_max_precision:
model.distributed.compute_dtype: float16
model.base_model.embeddings.full_precision_residual: true
model.base_model.head.fp32_lm_head: true
6 changes: 6 additions & 0 deletions fast_llm/data/document/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,12 @@ class LanguageModelBatchPreprocessingConfig(TokenPreprocessingConfig):
use_preference_spans: bool = Field(default=False)
use_grpo_data: bool = Field(default=False)
return_label_counts: bool = Field(default=False)
output_hidden_states: list[str] = Field(
default_factory=list,
desc="Regex patterns to add to each model input's `output_hidden_states` set."
" Matching `_debug`-named tensors get populated into `kwargs[hidden_states]`"
" and (when running under a `Run` context) emitted into `tensor_logs`.",
)

def _validate(self) -> None:
super()._validate()
Expand Down
7 changes: 7 additions & 0 deletions fast_llm/data/document/language_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,13 @@ def get_model_inputs(self, config: LanguageModelBatchPreprocessingConfig) -> lis

self._set_target_inputs(model_inputs, config)

if config.output_hidden_states:
import re

patterns = {re.compile(pattern) for pattern in config.output_hidden_states}
for model_input in model_inputs:
model_input.output_hidden_states.update(patterns)

return model_inputs

def _set_target_inputs(
Expand Down
Loading