AI-Hypercomputer · melissawm · Apr 15, 2026 · May 4, 2026
@@ -1,3 +1,5 @@
+(checkpoint-conversion)=
+
 # Checkpoint Conversion Utilities
 
 This guide provides instructions to use [checkpoint conversion scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion) to convert model checkpoints bidirectionally between Hugging Face and MaxText formats.
@@ -23,10 +25,12 @@ The following models are supported:
 
 ## Prerequisites
 
-- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html).
+- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](install-from-source).
 - Hugging Face model checkpoints are cached locally at `$HOME/.cache/huggingface/hub` before conversion. Ensure you have sufficient disk space.
 - Authenticate via the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/v0.21.2/guides/cli) if using private or gated models.
 
+(hf-to-maxtext)=
+
 ## Hugging Face to MaxText
 
 Use the `to_maxtext.py` script to convert a Hugging Face model checkpoint into a MaxText checkpoint. The script will automatically download the specified model from the Hugging Face Hub, perform conversion, and save converted checkpoints to the given output directory.
@@ -71,7 +75,7 @@ You can find your converted checkpoint files under `${BASE_OUTPUT_DIRECTORY}/0/i
 ### Key Parameters
 
 - `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
 - `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
 - `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Google Cloud Storage (GCS) or local.
 - `hardware=cpu`: The conversion script runs on a CPU machine.
@@ -118,7 +122,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
 
 - `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
 - `load_parameters_path`: The path to the MaxText Orbax checkpoint.
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
 - `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
 - `hardware=cpu`: The conversion script runs on a CPU machine.
 - `base_output_directory`: The path where the converted checkpoint will be stored; it can be Google Cloud Storage (GCS), Hugging Face Hub or local.
@@ -128,7 +132,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
 
 To ensure the conversion was successful, you can use the [test script](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/utils/forward_pass_logit_checker.py). It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion.
 
-> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html#from-source).
+> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](install-from-source).
 
 ### Setup Environment
 
@@ -159,7 +163,7 @@ python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml \
 
 - `load_parameters_path`: The path to the MaxText Orbax checkpoint (e.g., `gs://your-bucket/maxtext-checkpoint/0/items`).
 - `model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
 - `use_multimodal`: Indicates if multimodality is used.
 - `--run_hf_model` (Optional): Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits.
 - `--hf_model_path` (Optional): The path to the Hugging Face checkpoint (if `--run_hf_model=True`).

@@ -26,6 +26,8 @@ Currently MaxText has three data input pipelines:
 | **[Hugging Face](data_input_pipeline/data_input_hf.md)**           | datasets in [Hugging Face Hub](https://huggingface.co/datasets)<br>local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access)                                                                                                                                                                                                                                                                                                                                                                     | no download needed, convenience; <br>multiple formats                                                                                                                           | limit scalability using the Hugging Face Hub (no limit using Cloud Storage); <br>non-deterministic with preemption<br>(deterministic without preemption)<br> |
 | **[TFDS](data_input_pipeline/data_input_tfds.md)**                 | TFRecord (sequential access), available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview)                                                                                                                                                                                                                                                                                                                                                                                               | performant                                                                                                                                                                      | only supports TFRecords; <br>non-deterministic with preemption<br>(deterministic without preemption)                                                         |
 
+(multihost-dataloading-best-practice)=
+
 ## Multihost dataloading best practice
 
 Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues:

@@ -1,3 +1,5 @@
+(grain-pipeline)=
+
 # Grain pipeline
 
 ## The recommended input pipeline for determinism and resilience!
@@ -30,6 +32,8 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
 - **Global shuffle**: This feature is only available when using Grain with [ArrayRecord](https://github.com/google/array_record) (random access) format, achieved by shuffling indices globally at the beginning of each epoch and then reading the elements according to the random order. This shuffle method effectively prevents local overfitting, leading to better training results.
 - **Hierarchical shuffle**: For sequential access format [Parquet](https://arrow.apache.org/docs/python/parquet.html), shuffle is performed by these steps: file shuffling, interleave from files, and window shuffle using a fixed size buffer.
 
+(using-grain)=
+
 ## Using Grain
 
 1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.

@@ -20,15 +20,15 @@ This documentation acts as the primary resource for efficiently integrating new
 
 ## 1. Architecture Analysis
 
-The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](https://maxtext.readthedocs.io/en/latest/reference/architecture/architecture_overview.html) and [list of supported models](https://maxtext.readthedocs.io/en/latest/reference/models/supported_models_and_architectures.html).
+The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](architecture-overview) and [list of supported models](supported-models).
 
-**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](https://maxtext.readthedocs.io/en/latest/guides/data_input_pipeline.html)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
+**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](data-input-pipeline)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
 
 **Tokenizer**: Supported [tokenizer options](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/input_pipeline/tokenizer.py) include `TikTokenTokenizer`, `SentencePieceTokenizer`, and `HFTokenizer`.
 
 **Self-Attention & RoPE**: Available mechanisms include optimized [Flash Attention](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/layers/attention_op.py#L1184) (supporting MHA, GQA, and MQA), Multi-head Latent Attention ([MLA](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/attention_mla.py)), and [Gated Delta Network](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/models/qwen3.py#L358). MaxText also supports [Regular](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L108), [Llama](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L178), and [YaRN](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L282) variations of Rotary Positional Embeddings (RoPE).
 
-**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/moe_configuration.html) for routed and shared experts.
+**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](moe-configuration) for routed and shared experts.
 
 **Normalization**: We support different [normalization strategies](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/normalizations.py), including RMSNorm and Gated RMSNorm. These can be configured before or after attention/MLP layers.
 
@@ -44,7 +44,7 @@ This step can be bypassed if the current MaxText codebase already supports all c
 
 While most open-source models are distributed in Safetensors or PyTorch formats, MaxText requires conversion to the [Orbax](https://orbax.readthedocs.io/en/latest/) format.
 
-There are [two primary formats](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/checkpoints.html) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
+There are [two primary formats](checkpoints) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
 
 - **Scanned Format**: Recommended for **training** as it stacks layers for efficient processing via `jax.lax.scan`. To enable this, set `scan_layers=True`.
 - **Unscanned Format**: Recommended for **inference** to simplify loading individual layer parameters. To enable this, set `scan_layers=False`.
@@ -58,7 +58,7 @@ Success starts with a clear map. You must align the parameter names from your so
 
 ### 3.2 Write Script
 
-Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommended to use the [checkpoint conversion utility](https://maxtext.readthedocs.io/en/latest/guides/checkpointing_solutions/convert_checkpoint.html) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
+Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommended to use the [checkpoint conversion utility](checkpoint-conversion) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
 
 ### 3.3 Verify Compatibility
 
@@ -132,7 +132,7 @@ If you run the `forward_pass_logit_checker.py` to compare reference logits with
 
 **Q: How to compile models for a target hardware without physical access?**
 
-**A:** If you need to compile your training run ahead of time, use the train_compile.py tool. This utility allows you to compile the primary train_step for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](https://maxtext.readthedocs.io/en/latest/guides/monitoring_and_debugging/features_and_diagnostics.html#ahead-of-time-compilation-aot) for more examples.
+**A:** If you need to compile your training run ahead of time, use the train_compile.py tool. This utility allows you to compile the primary train_step for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](aot-compilation) for more examples.
 
 **Q: My model is too large for my development machine. What should I do?**
 

@@ -254,7 +254,7 @@ Ironwood over ICI:
 - `3 * M * 8 / 2 > 12800`
 - `M > 1100`
 
-It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html) for specific challenges regarding PP + FSDP/DP.
+It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](sharding_on_TPUs) for specific challenges regarding PP + FSDP/DP.
 
 ## Step 4. Analyze experiments
 

@@ -86,7 +86,7 @@ To install, click the `Extensions` icon on the left sidebar (or press `Ctrl+Shif
 
 ### Step 3: Install MaxText and Dependencies
 
-To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
+To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](install-from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
 
 > **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
 
@@ -139,7 +139,7 @@ pip3 install jupyterlab
 
 ### Step 3: Install MaxText and Dependencies
 
-To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
+To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](install-from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
 
 > **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
 

@@ -74,7 +74,7 @@ This is the easiest way to get started with the latest stable version.
      access to the `build_maxtext_docker_image`, `upload_maxtext_docker_image`,
      and `xpk` commands. For more details on building and uploading Docker
      images, see the
-     [Build MaxText Docker Image](https://maxtext.readthedocs.io/en/latest/build_maxtext.html)
+     [Build MaxText Docker Image](./build_maxtext)
      guide.
 
      ```bash

@@ -1,3 +1,5 @@
+(architecture-overview)=
+
 # Architecture overview
 
 ## The MaxText philosophy