[NNX] NNX migration prep (9/N): NNX-aware QK-Clip + checkpoint utilities#3836
Draft
ecnal-cienet wants to merge 8 commits intomainfrom
Draft
[NNX] NNX migration prep (9/N): NNX-aware QK-Clip + checkpoint utilities#3836ecnal-cienet wants to merge 8 commits intomainfrom
ecnal-cienet wants to merge 8 commits intomainfrom
Conversation
- Add TrainStateNNX (layers/train_state_nnx.py) with checkpoint and unit tests - Refactor model_creation_utils with create_nnx_abstract_model(); add NNX support to muon_utils - Add get_abstract_state_nnx() and get_nnx_named_sharding_with_scan_axis() to maxtext_utils.py - Wire NNX train state into train.py and train_utils.py with pure_nnx dispatch
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
4342ae0 to
68eb7ce
Compare
Part 1 — sharding diagnostics: - maxtext_utils.py: extend print_shardings_params to support NNX (nnx.State input) - run_sharding_dump.py: add --pure_nnx flag Part 2 — post-training bugfixes (NNX-side): - models.py: unpack MultimodalInput before passing to NNXDecoder (was passing the whole object as multimodal_input= kwarg; NNXDecoder only accepts the individual image/audio/mask fields) - optimizers.py: guard adam_pax against scalar LR from optax.inject_hyperparams (callable() check before invoking learning_rate_fn) - train_distill.py / train_sft.py / train_rl.py: avoid nesting nnx.value_and_grad inside nnx.jit (Tunix's default trainer), which raises "graph structure of a node added to cached_partial was mutated" — refactor to jax.value_and_grad with explicit nnx.split / nnx.merge; train_rl.py also adds with_sharding_constraint + dtype-cast compat shims for jax 0.9 / tpu_inference Linen<->NNX checkpoint conversion utility and validation tool moved to a follow-up PR (PR4.5) to keep this change reviewable.
68eb7ce to
02ff5f7
Compare
4 tasks
Bidirectional Linen <-> NNX checkpoint conversion. Same on-disk shape
both directions; round-trips preserve byte values.
Top-level key mapping:
- Linen params/params/<model> <-> NNX model/<model> (double-nesting,
{value:} wrappers).
- Linen opt_state <-> NNX optimizer/opt_state (params level on mu/nu).
- Linen step <-> NNX optimizer/step.
Layer structure:
- scan_layers=True (default): stack layers_N -> layers tensor.
- scan_layers=False: rename layers_N -> integer-keyed layers/{N}.
NNX->Linen direction auto-detects which layer layout the source uses.
--direction=auto picks Linen vs NNX from top-level keys.
Pure utility addition. No production-code dependencies; PR5+ do not
depend on this branch. Comparison utility split into PR4.6.
Bug fixes (run as no-op while pure_nnx=False stays default):
- nnx_wrappers.py: add _refresh_variable_trace_state + is_linen_initializing;
call from ToLinen after nnx.update to fix "Cannot extract graph node from
different trace level" when grad tracers leak into Variable._trace_state.
- gpt_oss.py / olmo3.py: replace inline nn.Dropout(...) with self.dropout =
linears.Dropout(...) in __init__ to fix CallCompactUnboundModuleError.
- normalizations.py: Qwen3NextRMSNorm signature: eps -> epsilon, accept
shard_mode/kernel_axes/parameter_memory_host_offload for callsite parity.
- attentions.py / qwen3.py: callsites eps= -> epsilon=.
- moe.py: per_expert_scale block moved into the unfused-kernel else branch
(was scaling wo even when fused_kernel was active).
- models.py: build MTP block as MultiTokenPredictionBlock(...) directly
(drop the ToNNX(linen) + lazy_init wrap); pass multimodal_input whole
to NNXDecoder instead of unpacking 5 fields.
- gradient_accumulation.py: ZeRO-1+GA all-reduce annotation deferred until
after lax.scan (reduced/unreduced PartitionSpec is rejected inside scan
carry); use nnx.merge(..., copy=True) to avoid Variable reuse.
- diloco.py: NNX-aware state handling — state.params -> state.model.filter
(nnx.Param), step counter at state.optimizer.step, replace_nnx_model_params
helper for jax.lax.cond pytree-structure parity.
- train_compile.py: new _collect_nnx_activation_shardings helper (forward
pass populates _ACTIVATION_SHARDINGS_DUMP — get_abstract_state_nnx only
traces __init__); NNX path now passes 2-arg shaped_train_args (no rng);
diloco path patched to handle the 2-vs-3 length difference.
- muon_utils.py: get_model_mdn default pure_nnx=True; wrap NNX result as
{"params": nnx.to_pure_dict(...)} for parity with Linen tree shape.
- nnx_decoders.py: FP8+NNX scan fix — Linen FP8 ops (fp8_nanoo, fp8_gpu)
retain tracers in Linen scope across re-traces. Skip jax.checkpoint and
use a Python for-loop instead of jax.lax.scan when quantization is FP8.
Makes FP8 quantization usable on the NNX path.
- train.py (pre-train train_step): return nnx.state(new_state, nnx.Not
(nnx.Intermediate)) so sowed forward-pass artifacts (e.g. max_logits for
QK-Clip) don't break leaf-count parity with state_mesh_shardings.
- llama2.py: pass parameter_memory_host_offload to pre_self_attention_layer
_norm RMSNorm (was missing on this norm only).
- base.yml: add 4 pipeline-related logical_axis_rules — layers_outside
_pipeline, layers_per_stage, num_activations, circular_repeats. Additive,
no-op without use_nnx_pipeline=True.
NNX feature enablements (clear all 17 "Pure NNX support has not been
implemented yet" NotImplementedError sites by routing Linen-coupled
utilities to the Linen path; their on-disk format is Linen):
- layerwise_quantization.py (2 sites): operates on Linen-format checkpoints
via DeepSeek*ToLinen layers.
- lora_utils.py (1 site): downstream get_lora_abstract_state expects Linen
tree shape; LoRA adapters on disk are Linen.
- standalone_checkpointer.py (2 sites): add_entropy_to_checkpoint accesses
state.opt_state[0]._replace(mu=..., nu=...) — Linen-only.
- generate_param_only_checkpoint.py (3 sites): _possibly_unroll_params and
_save_decode_checkpoint use state.params["params"]["decoder"] — Linen.
- convert_gpt3_ckpt_from_paxml.py (2 sites): keystr_map targets Linen tree
paths (.params['params'], .opt_state.mu['params']).
- maxengine.py (3 sites): inference engine uses state.params and serves
Linen-format inference checkpoints.
- grpo_trainer.py (4 sites): RL trainer is end-to-end Linen-shaped; route
to Linen with a clear log warning since NNX-format checkpoints will fail
at restore time.
Vocab tiling on NNX (real implementation, not just routing):
- models.py: add Transformer.logits_from_hidden_states on the NNX
Transformer class — wraps NNXDecoder.apply_output_head with the
token_embedder; mirrors TransformerLinenPure.logits_from_hidden_states.
- vocabulary_tiling.py: add vocab_tiling_nnx_loss — chunks the vocab axis
via jax.lax.scan and calls model.logits_from_hidden_states(chunk) per
chunk. The NNX model carries its parameters internally so no explicit
FSDP gather is needed (unlike the Linen gathered_params pattern). MVP
uses default autograd; custom_vjp memory-savings optimization is a
follow-up if backward memory becomes a concern.
- train.py (NNX loss_fn): replace the NotImplementedError with the call
to vocab_tiling_nnx_loss using hidden_states from intermediates.
- pyconfig_deprecated.py / configs/types.py: drop the num_vocab_tiling > 1
and enable_nnx validation guards (no longer needed).
DPO + NNX retained as NotImplementedError but with a much more informative
message (points users at pure_nnx=False workaround). Full implementation
is deferred — needs a new TrainState shape carrying both policy and
reference NNX models plus an NNX dpo_loss_fn.
Stats: 26 source files modified, +406 / -171 lines. Linen invariant
verified: pure_nnx / enable_nnx / pure_nnx_decoder still default to False;
Linen-path UTs unaffected (3 pre-existing failures on the parent branch
remain unchanged — sharding_compare_test::deepseek2-16b,
optimizers_test::test_model_integration_kimi-k2-1t, diloco_test::two
_slices x2). All "Pure NNX support has not been implemented yet"
NotImplementedError sites cleared (was 17, now 0).
Implements NNX-native DPO so that the pure_nnx=True training path no longer raises NotImplementedError on use_dpo runs. The Linen DPO overlay pattern (model.apply(params=..., reference_params=...)) does not translate to NNX modules, which carry their parameters internally. Instead the policy and reference models are held as separate nnx.Module instances on TrainStateNNX, and a new dpo_loss_fn_nnx runs both forwards with stop_gradient on the reference logits. TrainStateNNX: - Add optional `reference_model: nnx.Module` field. apply_gradients continues to update only `self.model`, leaving `self.reference_model` bit-identical across steps. dpo_utils.py: - Add dpo_loss_fn_nnx(policy_model, config, data, dropout_rng, params, reference_model, is_train=True). Signature mirrors the Linen dpo_loss_fn so it slots into gradient_accumulation_loss_and_grad's dispatcher (dropout_rng / params slots are unused for NNX; carried for parity, and reference_model is passed as the single extra_dpo_args entry). With nnx.value_and_grad(..., argnums=0) over the policy, no gradient flows to the reference model's nnx.Param leaves; the explicit jax.lax.stop_gradient on ref_logits is a belt-and-braces guard. - Both dpo_loss_fn (Linen) and dpo_loss_fn_nnx (NNX) now include indexer_loss=0.0 and mtp_loss=0.0 in aux so the gradient_accumulation aux pytree shape matches the non-DPO loss_fn. train.py: - Drop the NotImplementedError in train_step's NNX branch. When use_dpo, dispatch to dpo_loss_fn_nnx with state.reference_model as extra_dpo_args; otherwise use the regular loss_fn. eval_step gains the same dispatch. - diff_wrapper picks _loss_fn / extra_dpo_args from the per-path init block, so both the GA and non-GA NNX paths route DPO identically. - Checkpoint-save _split_dpo_state stripping is now Linen-only; TrainStateNNX saves whole (reference_model included) — the step-0 reload later overwrites reference_model from the step-0 checkpoint. train_utils.py: - NNX init_state_fn materializes a frozen reference_model alongside the policy when config.use_dpo. Both are constructed by _create_model_partial() with config.init_weights_seed, so they start identical (standard DPO practice) until the step-0 reload. - Step-0 checkpoint reload: copy step0_state["model"] into state["reference_model"]. Linen path unchanged. Tests: - New tests/unit/dpo_nnx_test.py (7 tests): TrainStateNNX reference_model init/hasattr semantics; apply_gradients leaves reference bit-identical; aux key set; identical policy/reference yields loss=log(2) and reward_accuracy=0.0 (strict > on equal logratios); dropout_rng/params slots are signature-compat only; nnx.value_and_grad(argnums=0) over the policy yields finite grads on policy params only. - train_nnx_test.py: drop the two stale negative tests (vocab_tiling_raises_not_implemented, train_step_dpo_raises_for_nnx) — both features are now real. Stats: 4 source files + 2 test files, +199/-22 source lines. Linen DPO path behaviorally unchanged (only adds two harmless aux-dict keys); NNX non-DPO path unchanged (all changes gated on config.use_dpo).
…e.py)
PR5 audited maxengine.py and routed the inference path to the Linen
implementation regardless of pure_nnx, with a comment block explaining
that "the flag affects training, not inference serving." That kept the
Linen serving path unchanged but meant pure_nnx=True users silently got
the Linen engine. This change replaces the route with a real NNX flow:
when config.pure_nnx=True, the engine builds an NNX Transformer, splits
out (params, cache, rest) with nnx.split, and at every JIT body merges
the model concretely with nnx.merge to run the forward pass. Linen is
preserved byte-for-byte; every NNX edit is gated `if config.pure_nnx:`
and pure_nnx=False is still the default.
maxengine.py (__init__):
- Build two abstract NNX Transformers on the NNX path: self.model with
model_mode=PREFILL (batch=1, single padded prompt) and self.model_ar
with model_mode=AUTOREGRESSIVE (batch=micro_batch_size_to_train_on,
decode_state shape). Both are needed because NNX cache vars inherit
CACHE_BATCH_PREFILL vs CACHE_BATCH from the construction model_mode,
and bulk_insert searches for the substring "cache_batch" in the
AR-mode logical-axes tuple. nnx.eval_shape is called directly inside
nn_partitioning.axis_rules rather than through create_nnx_abstract_model
to avoid the jax.set_mesh wrap that trips Flax 0.12.6 on logical-only
axes like "norm" (same reason get_abstract_state_nnx avoids set_mesh).
- Cache the graphdef from a 3-way nnx.split(Param, Cache, ...) so JIT
bodies can pass (params, cache, rest) separately to nnx.merge. The
rest slot (RNG vars etc.) is materialized concretely in load_params.
maxengine.py (cache adapter + _nnx_run_model):
- bulk_insert / _insert_jit / _maybe_*_prefill_result_cache walk the
cache via tree_map_with_path and switch on path[-1].key (the cache
variable name like "cached_prefill_key"). Linen mutable cache is a
plain nested dict. NNX Cache state would expose a ".value" accessor
at that position. Bridge via nnx.State.to_pure_dict() (after the
model run) and nnx.replace_by_pure_dict (before nnx.merge), so the
cache plumbing helpers see the same shape on both paths.
- Add _nnx_run_model: nnx.merge(graphdef, params, cache, rest, copy=True)
-> model(...) -> nnx.state(model, nnx.Cache).to_pure_dict(). copy=True
avoids reusing Variable objects across traces (TraceContextError),
mirroring train.py's diff_wrapper workaround.
- Add _nnx_cache_state_template / _nnx_init_cache_dict helpers
parametrised by mode so prefill (batch 1) and decode_state (batch N)
pull from the right abstract model.
maxengine.py (load_params):
- New _load_params_nnx: accepts user-provided NNX-shape params or loads
via from_pretrained. For user-provided params, materializes a concrete
model once via _create_model_fn() to capture a real rest state for
nnx.merge (wasteful but simple; the from_pretrained branch avoids
this). Refreshes self.graphdef from the concrete model so subsequent
merges line up exactly.
- Builds self.abstract_params, populates self.prefill_kv_cache_annotations
and self.kv_cache_annotations (using model_ar for the latter so
bulk_insert's substring lookup hits), wraps both into NamedSharding.
- pure_nnx + quantization, pure_nnx + LoRA, pure_nnx +
stack_prefill_result_cache=True, pure_nnx + prefill_multisampling,
and pure_nnx + prefill_concat raise NotImplementedError for now;
the Linen path is the workaround. AOT compilation
(aot_compile / _compile_generate_and_get_layouts) is not gated and
may work as-is; not exercised by tests yet.
maxengine.py (init_decode_state, _prefill_jit, _generate_jit):
- _init_decode_state_nnx zero-initializes a pure-dict cache from
model_ar (so the leading batch dim matches generate's input shape)
and builds kv_cache_annotations_named per leaf by reading
nnx.Cache.metadata. Tries "out_sharding", "sharding", and
"sharding_names" because Flax 0.12.6 renamed these.
- _prefill_jit / _generate_jit add an `if config.pure_nnx:` branch
that calls _nnx_run_model in place of self.model.apply with
mutable=["cache"]. existing_prefix.cache is threaded as a pure-dict
cache directly (no params|{"cache":...} dict-merge — params is an
nnx.State, not a dict).
maxtext_utils.py:
- New get_prefill_kv_cache_annotations_nnx / get_kv_cache_annotations_nnx
that mirror the Linen helpers' return shape (per-leaf PartitionSpec
tree). Both delegate to _nnx_cache_partition_specs which extracts
nnx.Cache state via nnx.split, calls
get_nnx_named_sharding_with_scan_axis inside
nn_partitioning.axis_rules so logical axes ("layers", "cache_batch",
"norm", ...) resolve to physical mesh axes, and converts the result
to a pure-dict tree.
tests/unit/maxengine_test.py:
- New tests: test_init_nnx, test_basic_prefill_nnx (with NaN/inf and
per-layer cache shape checks), test_basic_decode_nnx (4-step generate
with next_pos advancement check), test_quantize_raises_for_nnx,
test_lora_raises_for_nnx.
- New test_linen_nnx_parity_prefill: bridges Linen-init params into
the NNX engine via linen_nnx_converter (convert_linen_to_nnx ->
_strip_value_wrappers -> nnx.replace_by_pure_dict) and asserts the
NNX engine's prefill matches Linen on the same weights — logits
within bf16 tolerance (rtol=0.05, atol=0.1; the test config uses
bf16 compute) and exact greedy first-token argmax.
- Existing Linen tests untouched.
Test summary: 9 passed, 1 skipped (test_chunked_prefill is a
pre-existing CPU-only skip). bash lint.sh: codespell + pylint + pyink
all green.
Closes the QK-Clip TODO and migrates the remaining Linen-only checkpoint utilities to NNX. Linen paths preserved byte-for-byte (every NNX edit is gated on `config.pure_nnx` or runtime state-shape detection). QK-Clip: - qk_clip_utils.apply_qk_clip_nnx mutates state.model in place via nnx.split -> pure-dict tree_map -> nnx.replace_by_pure_dict -> nnx.update. Accepts both the production NNX intermediate shape (self_attention.attention_op.max_logits) and the synthetic-fixture shape from the existing Linen tests (self_attention.max_logits). - train.py train_step dispatches to apply_qk_clip_nnx for NNX, removing the prior TODO at the QK-Clip call site. Checkpoint utilities (NNX paths added): - standalone_checkpointer.checkpoint_loop builds an NNX init_state_fn under pure_nnx; add_entropy_to_checkpoint dispatches across Linen TrainState, NNX TrainStateNNX Module, and post-split nnx.State shapes. - generate_param_only_checkpoint: NNX init_state_fn under pure_nnx; _possibly_unroll_params_nnx slices scanned NNX layers via dict-style state mutation; _save_decode_checkpoint_nnx writes a bf16 pure-dict tree to orbax. Parallel LoRA decode flow operates on the single-nested LoRA delta tree from PR8's get_lora_abstract_state_nnx. - convert_gpt3_ckpt_from_paxml: parallel NNX state_map keystr translation (.params['params']<rest> -> .model<rest>.value, etc.). End-to-end paxml -> NNX conversion is wired but not yet validated on hardware. Tests: - qk_clip_test: 7 new NNX cases covering attention-type guard, MLA wq_b/wkv_b math, both intermediate shapes, no-clip-below-threshold, missing-stats resilience, Linen<->NNX numeric parity. - standalone_checkpointer_nnx_test (new): 3 cases for adam mu/nu overwrite on TrainStateNNX Module shape, no mutation of state.model params, post-split nnx.State shape from setup_training_state. - generate_param_only_checkpoint_nnx_test (new): 3 cases for scanned layer slicing (Llama-style; DeepSeek-style dense+moe split; LoRA delta unroll on the single-nested NNX shape). NNX + AQT in MaxEngine and the layerwise_quantization NNX path are split into the follow-up PR9.5.
02ff5f7 to
b5fd654
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NNX Migration Route Map
pure_nnxflag,init_state_fn,TrainStateNNX, NNX utils. Linen workflow unchanged. (PR NNX migration prep (1/N): pure_nnx flag and init_state_fn scaffolding #3427)get_abstract_state_nnx,get_named_sharding_nnx,set_named_sharding_nnx,get_partition_spec_nnx,get_mesh_from_config. (PR NNX migration prep (2/N): NNX utils and sharding utilities #3470)TrainStateNNX, model creation, gradient accumulation, checkpointing, and training loop dispatch. (PR NNX migration prep (3/N): TrainState, model creation, and end-to-end training loop #3500)4.5. ✅ Linen↔NNX checkpoint converter. (PR [NNX] NNX migration prep (4.5/N): Linen<->NNX checkpoint converter #3843)
4.6. ❌ Linen↔NNX checkpoint comparator (sibling branch on PR4.5).
apply_qk_clip_nnxmutatesstate.modelin place (resolves thetrain.py:517TODO); NNX paths added tostandalone_checkpointer,generate_param_only_checkpoint,convert_gpt3_ckpt_from_paxml. Originally bundled with NNX-AQT and a gpt3 prefill fix; on 2026-05-07 those were split into PR9.5. Stacks on PR8.9.5. ❌ NNX + AQT in MaxEngine + serve-mode reload + gpt3 prefill fix (stacked follow-up).
custom_vjpfor NNX.True; regenerate sharding goldens; flip back integration-testpure_nnx=Falseannotations.Description
This PR closes the QK-Clip TODO and migrates three Linen-only checkpoint utilities (
standalone_checkpointer,generate_param_only_checkpoint,convert_gpt3_ckpt_from_paxml) to NNX. NNX-shape walkers sit alongside the existing Linen ones, dispatching onconfig.pure_nnx(or runtime state-shape detection); every Linen path is preserved byte-for-byte.The original PR9 also bundled NNX-AQT + a gpt3 prefill fix; on 2026-05-07 those were split into a stacked follow-up to keep each PR narrowly reviewable. This PR's diff: +907 / −161 across 8 files (5 source + 3 tests, of which 2 are new).
Part 1: NNX-aware QK-Clip
src/maxtext/utils/qk_clip_utils.pyfactors shared math out of the existing Linen helper and adds an NNX sibling:_max_logits_at,_scale_from_max_logits,_clip_mla_weight— shared across Linen and NNX paths.apply_qk_clip_nnx(state, intermediate_outputs, config)mutatesstate.modelin place viannx.split→ pure-dicttree_map→nnx.replace_by_pure_dict→nnx.update. Accepts both the production NNX intermediate shape (self_attention.attention_op.max_logits— sown insideAttentionOp) and the synthetic-fixture shape used by the existing Linen tests (self_attention.max_logits).calculate_max_logit_metricrecognizes both the Linen(array,)-tuple shape and the bare-array NNX shape.src/maxtext/trainers/pre_train/train.py::train_stepnow branches onisinstance(model, nn.Module)to callapply_qk_clipfor Linen andapply_qk_clip_nnxfor NNX. The TODO at the QK-Clip call site is removed.Part 2: NNX-format Checkpoint Utilities
Each utility gets an explicit NNX path; every routing-to-Linen comment is gone.
src/maxtext/utils/standalone_checkpointer.py:checkpoint_loopbuilds an NNXinit_state_fnunderpure_nnx(mirroring PR8's GRPO trainer).add_entropy_to_checkpointdispatches across three input shapes (LinenTrainState, NNXTrainStateNNXModule, post-splitnnx.State). All three produce identicalcos(1000*p)/sin(1000*p)mu/nu replacements.src/maxtext/utils/generate_param_only_checkpoint.py:_read_train_checkpointbuilds an NNXinit_state_fnunderpure_nnx. New_possibly_unroll_params_nnxslices scanned NNX layers via dict-style mutation onstate.model.decoder(usespop/__setitem__sincennx.Stateis dict-like). New_save_decode_checkpoint_nnxwrites a bf16 pure-dict tree to orbax (matches the NNX-detection branch infrom_pretrained). Parallel LoRA decode flow (_generate_lora_decode_checkpoints_nnx+_possibly_unroll_lora_params_nnx+_save_lora_decode_checkpoint_nnx) operates on the single-nested LoRA delta tree from PR8'sget_lora_abstract_state_nnx({"decoder": {...}}, no outerparamswrap).src/maxtext/checkpoint_conversion/standalone_scripts/convert_gpt3_ckpt_from_paxml.py: parallel NNXstate_mapkeystr translation (.params['params']<rest>→.model<rest>.value,.opt_state.mu['params']<rest>→.optimizer.opt_state.mu<rest>.value, etc.). Save usesstate.optimizer.step.valuefor the step number on NNX. End-to-end paxml → NNX conversion is wired but not yet validated on hardware (needs a paxml gpt3 checkpoint plus TPU/GPU); only the import / shape-walk parts are exercised in this PR.Part 3: Carve-outs
Tests
New unit tests (
tests/unit/qk_clip_test.py— 7 new NNX cases on top of existing Linen suite):QKClipNNXTest: attention-type guard, MLAwq_b/wkv_bmath, both intermediate shapes, no-clip-below-threshold, missing-stats resilience, Linen↔NNX numeric parity on identical weights.CalculateMaxLogitNNXTest: bare-array intermediate shape recognition.New unit tests (
tests/unit/standalone_checkpointer_nnx_test.py, 3 tests): adam mu/nu overwrite onTrainStateNNXModule shape, no mutation ofstate.modelparams, post-splitnnx.Stateshape fromsetup_training_state.New unit tests (
tests/unit/generate_param_only_checkpoint_nnx_test.py, 3 tests): Llama-style scanned-layer slicing (singlelayersgroup), DeepSeek-style scanned-layer slicing (dense_layers+moe_layerssplit), LoRA delta unroll on the single-nested NNX-derived shape.Existing Linen tests: untouched and still pass;
pure_nnx=Falsestays default.Test results: 23 passed, 2 skipped across the PR9 surface —
qk_clip_test,standalone_checkpointer_nnx_test,generate_param_only_checkpoint_nnx_test.Linting:
bash lint.sh— pyink + pylint clean.Stats
config.pure_nnxor runtime state-shape detection.Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.