diff --git a/docs/guides/checkpointing_solutions/convert_checkpoint.md b/docs/guides/checkpointing_solutions/convert_checkpoint.md
index cfb32e0d5a..3df99a5886 100644
--- a/docs/guides/checkpointing_solutions/convert_checkpoint.md
+++ b/docs/guides/checkpointing_solutions/convert_checkpoint.md
@@ -1,3 +1,5 @@
+(checkpoint-conversion)=
+
# Checkpoint Conversion Utilities
This guide provides instructions to use [checkpoint conversion scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion) to convert model checkpoints bidirectionally between Hugging Face and MaxText formats.
@@ -23,10 +25,12 @@ The following models are supported:
## Prerequisites
-- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html).
+- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](install-from-source).
- Hugging Face model checkpoints are cached locally at `$HOME/.cache/huggingface/hub` before conversion. Ensure you have sufficient disk space.
- Authenticate via the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/v0.21.2/guides/cli) if using private or gated models.
+(hf-to-maxtext)=
+
## Hugging Face to MaxText
Use the `to_maxtext.py` script to convert a Hugging Face model checkpoint into a MaxText checkpoint. The script will automatically download the specified model from the Hugging Face Hub, perform conversion, and save converted checkpoints to the given output directory.
@@ -71,7 +75,7 @@ You can find your converted checkpoint files under `${BASE_OUTPUT_DIRECTORY}/0/i
### Key Parameters
- `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
- `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
- `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Google Cloud Storage (GCS) or local.
- `hardware=cpu`: The conversion script runs on a CPU machine.
@@ -118,7 +122,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
- `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
- `load_parameters_path`: The path to the MaxText Orbax checkpoint.
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
- `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
- `hardware=cpu`: The conversion script runs on a CPU machine.
- `base_output_directory`: The path where the converted checkpoint will be stored; it can be Google Cloud Storage (GCS), Hugging Face Hub or local.
@@ -128,7 +132,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
To ensure the conversion was successful, you can use the [test script](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/utils/forward_pass_logit_checker.py). It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion.
-> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html#from-source).
+> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](install-from-source).
### Setup Environment
@@ -159,7 +163,7 @@ python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml \
- `load_parameters_path`: The path to the MaxText Orbax checkpoint (e.g., `gs://your-bucket/maxtext-checkpoint/0/items`).
- `model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
- `use_multimodal`: Indicates if multimodality is used.
- `--run_hf_model` (Optional): Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits.
- `--hf_model_path` (Optional): The path to the Hugging Face checkpoint (if `--run_hf_model=True`).
diff --git a/docs/guides/data_input_pipeline.md b/docs/guides/data_input_pipeline.md
index 0b65bdfa6c..bbcca2401f 100644
--- a/docs/guides/data_input_pipeline.md
+++ b/docs/guides/data_input_pipeline.md
@@ -26,6 +26,8 @@ Currently MaxText has three data input pipelines:
| **[Hugging Face](data_input_pipeline/data_input_hf.md)** | datasets in [Hugging Face Hub](https://huggingface.co/datasets)
local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access) | no download needed, convenience;
multiple formats | limit scalability using the Hugging Face Hub (no limit using Cloud Storage);
non-deterministic with preemption
(deterministic without preemption)
|
| **[TFDS](data_input_pipeline/data_input_tfds.md)** | TFRecord (sequential access), available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview) | performant | only supports TFRecords;
non-deterministic with preemption
(deterministic without preemption) |
+(multihost-dataloading-best-practice)=
+
## Multihost dataloading best practice
Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues:
diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md
index 5a7d66981d..33fbb09dee 100644
--- a/docs/guides/data_input_pipeline/data_input_grain.md
+++ b/docs/guides/data_input_pipeline/data_input_grain.md
@@ -1,3 +1,5 @@
+(grain-pipeline)=
+
# Grain pipeline
## The recommended input pipeline for determinism and resilience!
@@ -30,6 +32,8 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
- **Global shuffle**: This feature is only available when using Grain with [ArrayRecord](https://github.com/google/array_record) (random access) format, achieved by shuffling indices globally at the beginning of each epoch and then reading the elements according to the random order. This shuffle method effectively prevents local overfitting, leading to better training results.
- **Hierarchical shuffle**: For sequential access format [Parquet](https://arrow.apache.org/docs/python/parquet.html), shuffle is performed by these steps: file shuffling, interleave from files, and window shuffle using a fixed size buffer.
+(using-grain)=
+
## Using Grain
1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
diff --git a/docs/guides/model_bringup.md b/docs/guides/model_bringup.md
index fb231a49e6..40e37180e4 100644
--- a/docs/guides/model_bringup.md
+++ b/docs/guides/model_bringup.md
@@ -20,15 +20,15 @@ This documentation acts as the primary resource for efficiently integrating new
## 1. Architecture Analysis
-The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](https://maxtext.readthedocs.io/en/latest/reference/architecture/architecture_overview.html) and [list of supported models](https://maxtext.readthedocs.io/en/latest/reference/models/supported_models_and_architectures.html).
+The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](architecture-overview) and [list of supported models](supported-models).
-**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](https://maxtext.readthedocs.io/en/latest/guides/data_input_pipeline.html)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
+**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](data-input-pipeline)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
**Tokenizer**: Supported [tokenizer options](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/input_pipeline/tokenizer.py) include `TikTokenTokenizer`, `SentencePieceTokenizer`, and `HFTokenizer`.
**Self-Attention & RoPE**: Available mechanisms include optimized [Flash Attention](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/layers/attention_op.py#L1184) (supporting MHA, GQA, and MQA), Multi-head Latent Attention ([MLA](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/attention_mla.py)), and [Gated Delta Network](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/models/qwen3.py#L358). MaxText also supports [Regular](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L108), [Llama](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L178), and [YaRN](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L282) variations of Rotary Positional Embeddings (RoPE).
-**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/moe_configuration.html) for routed and shared experts.
+**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](moe-configuration) for routed and shared experts.
**Normalization**: We support different [normalization strategies](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/normalizations.py), including RMSNorm and Gated RMSNorm. These can be configured before or after attention/MLP layers.
@@ -44,7 +44,7 @@ This step can be bypassed if the current MaxText codebase already supports all c
While most open-source models are distributed in Safetensors or PyTorch formats, MaxText requires conversion to the [Orbax](https://orbax.readthedocs.io/en/latest/) format.
-There are [two primary formats](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/checkpoints.html) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
+There are [two primary formats](checkpoints) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
- **Scanned Format**: Recommended for **training** as it stacks layers for efficient processing via `jax.lax.scan`. To enable this, set `scan_layers=True`.
- **Unscanned Format**: Recommended for **inference** to simplify loading individual layer parameters. To enable this, set `scan_layers=False`.
@@ -58,7 +58,7 @@ Success starts with a clear map. You must align the parameter names from your so
### 3.2 Write Script
-Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommended to use the [checkpoint conversion utility](https://maxtext.readthedocs.io/en/latest/guides/checkpointing_solutions/convert_checkpoint.html) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
+Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommended to use the [checkpoint conversion utility](checkpoint-conversion) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
### 3.3 Verify Compatibility
@@ -132,7 +132,7 @@ If you run the `forward_pass_logit_checker.py` to compare reference logits with
**Q: How to compile models for a target hardware without physical access?**
-**A:** If you need to compile your training run ahead of time, use the train_compile.py tool. This utility allows you to compile the primary train_step for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](https://maxtext.readthedocs.io/en/latest/guides/monitoring_and_debugging/features_and_diagnostics.html#ahead-of-time-compilation-aot) for more examples.
+**A:** If you need to compile your training run ahead of time, use the train_compile.py tool. This utility allows you to compile the primary train_step for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](aot-compilation) for more examples.
**Q: My model is too large for my development machine. What should I do?**
diff --git a/docs/guides/optimization/custom_model.md b/docs/guides/optimization/custom_model.md
index 991c322a99..45f583b40b 100644
--- a/docs/guides/optimization/custom_model.md
+++ b/docs/guides/optimization/custom_model.md
@@ -254,7 +254,7 @@ Ironwood over ICI:
- `3 * M * 8 / 2 > 12800`
- `M > 1100`
-It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html) for specific challenges regarding PP + FSDP/DP.
+It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](sharding_on_TPUs) for specific challenges regarding PP + FSDP/DP.
## Step 4. Analyze experiments
diff --git a/docs/guides/run_python_notebook.md b/docs/guides/run_python_notebook.md
index 6e6cc08091..26afffb3b6 100644
--- a/docs/guides/run_python_notebook.md
+++ b/docs/guides/run_python_notebook.md
@@ -86,7 +86,7 @@ To install, click the `Extensions` icon on the left sidebar (or press `Ctrl+Shif
### Step 3: Install MaxText and Dependencies
-To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
+To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](install-from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
> **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
@@ -139,7 +139,7 @@ pip3 install jupyterlab
### Step 3: Install MaxText and Dependencies
-To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
+To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](install-from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
> **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
diff --git a/docs/install_maxtext.md b/docs/install_maxtext.md
index 47d31e93cb..a3532d3a82 100644
--- a/docs/install_maxtext.md
+++ b/docs/install_maxtext.md
@@ -74,7 +74,7 @@ This is the easiest way to get started with the latest stable version.
access to the `build_maxtext_docker_image`, `upload_maxtext_docker_image`,
and `xpk` commands. For more details on building and uploading Docker
images, see the
- [Build MaxText Docker Image](https://maxtext.readthedocs.io/en/latest/build_maxtext.html)
+ [Build MaxText Docker Image](./build_maxtext)
guide.
```bash
diff --git a/docs/reference/architecture/architecture_overview.md b/docs/reference/architecture/architecture_overview.md
index 1b73145dcd..e59e09fc43 100644
--- a/docs/reference/architecture/architecture_overview.md
+++ b/docs/reference/architecture/architecture_overview.md
@@ -1,3 +1,5 @@
+(architecture-overview)=
+
# Architecture overview
## The MaxText philosophy
diff --git a/docs/reference/architecture/jax_ai_libraries_chosen.md b/docs/reference/architecture/jax_ai_libraries_chosen.md
index 4dac03eb44..9d29866a2b 100644
--- a/docs/reference/architecture/jax_ai_libraries_chosen.md
+++ b/docs/reference/architecture/jax_ai_libraries_chosen.md
@@ -56,11 +56,11 @@ For more information on using Orbax, please refer to https://github.com/google/o
1. **Deterministic by Design**: Grain allows storing data loader states, provides strong guarantees about data ordering and sharding even with preemptions, which is critical for reproducibility.
2. **Global Shuffle**: Prevents local overfitting.
-3. **Built for Multi-Host Training**: The using random access file format streamlines [data loading in the multi-host environments](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/data_input_pipeline.html#multihost-dataloading-best-practice).
+3. **Built for Multi-Host Training**: The using random access file format streamlines [data loading in the multi-host environments](multihost-dataloading-best-practice).
Its APIs are explicitly designed for the multi-host paradigm, simplifying the process of ensuring that each host loads a unique shard of the global batch.
-For more information on using Grain, please refer to https://github.com/google/grain and the grain guide in maxtext located at https://maxtext.readthedocs.io/en/latest/guides/data_input_pipeline/data_input_grain.html
+For more information on using Grain, please refer to https://github.com/google/grain and the [Grain guide in MaxText](grain-pipeline).
## Qwix: For native JAX quantization
diff --git a/docs/reference/core_concepts/batch_size.md b/docs/reference/core_concepts/batch_size.md
index 134a495c86..e74cdad8e7 100644
--- a/docs/reference/core_concepts/batch_size.md
+++ b/docs/reference/core_concepts/batch_size.md
@@ -34,11 +34,11 @@ You can set `per_device_batch_size` and `gradient_accumulation_steps` in `config
`global_batch_to_load` = `global_batch_size_to_train_on x expansion_factor_real_data`
-When `expansion_factor_real_data > 1`, only a subset of hosts read data from the source (e.g., a GCS bucket). These "loading hosts" read more data than they need for their own devices and distribute the surplus to other "non-loading" hosts. This reduces the number of concurrent connections to the data source, which can significantly improve I/O throughput. When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job. Details in https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/data_input_pipeline/data_input_grain.html#using-grain.
+When `expansion_factor_real_data > 1`, only a subset of hosts read data from the source (e.g., a GCS bucket). These "loading hosts" read more data than they need for their own devices and distribute the surplus to other "non-loading" hosts. This reduces the number of concurrent connections to the data source, which can significantly improve I/O throughput. When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job. Details in [](using-grain).
## Gradient Accumulation Steps
-`gradient_accumulation_steps` defines how many forward/backward passes are performed before the optimizer updates the model weights. The gradients from each pass are accumulated (summed). It is discussed in more detail [here](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/tiling.html#gradient-accumulation).
+`gradient_accumulation_steps` defines how many forward/backward passes are performed before the optimizer updates the model weights. The gradients from each pass are accumulated (summed). It is discussed in more detail [here](./tiling.md#gradient-accumulation).
For example, if `gradient_accumulation_steps` is set to `4`, the model will execute four forward and backward passes, sum the gradients, and then apply a single optimizer step. This achieves the same effective global batch size as quadrupling the `per_device_batch_size` with significantly less memory, but can potentially lead to lower MFU.
diff --git a/docs/reference/core_concepts/checkpoints.md b/docs/reference/core_concepts/checkpoints.md
index 3e6b220220..779609befb 100644
--- a/docs/reference/core_concepts/checkpoints.md
+++ b/docs/reference/core_concepts/checkpoints.md
@@ -14,6 +14,8 @@
limitations under the License.
-->
+(checkpoints)=
+
# Checkpoints
## Checkpoint formats
diff --git a/docs/reference/core_concepts/moe_configuration.md b/docs/reference/core_concepts/moe_configuration.md
index 96b3bbe65e..3f7cb75d6e 100644
--- a/docs/reference/core_concepts/moe_configuration.md
+++ b/docs/reference/core_concepts/moe_configuration.md
@@ -14,6 +14,8 @@
limitations under the License.
-->
+(moe-configuration)=
+
# Mixture of Experts (MoE) Configuration
This document provides a detailed explanation of the configuration parameters related to Mixture of Experts (MoE) models in MaxText. These settings control the model architecture, routing mechanisms, and performance optimizations. Default values and parameter definitions are located in `src/maxtext/configs/base.yml` and are primarily used in `src/maxtext/layers/moe.py`.
diff --git a/docs/reference/core_concepts/tiling.md b/docs/reference/core_concepts/tiling.md
index 90669668da..9a92d3c2d3 100644
--- a/docs/reference/core_concepts/tiling.md
+++ b/docs/reference/core_concepts/tiling.md
@@ -80,4 +80,4 @@ Tiling is also crucial for managing data movement across the memory hierarchy (H
**Tiling** and **sharding** are independent concepts that do not conflict; in fact, they are often used together. Sharding distributes a tensor across multiple devices, while tiling processes a tensor in chunks on the same device.
-To learn more about sharding in MaxText, please refer to the [sharding documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html).
+To learn more about sharding in MaxText, please refer to the [sharding documentation](sharding_on_TPUs).
diff --git a/docs/reference/models/supported_models_and_architectures.md b/docs/reference/models/supported_models_and_architectures.md
index fb80002df5..a01c0516be 100644
--- a/docs/reference/models/supported_models_and_architectures.md
+++ b/docs/reference/models/supported_models_and_architectures.md
@@ -1,3 +1,5 @@
+(supported-models)=
+
# Supported models list
> **Purpose**: This page provides detailed, reference-style information about model families supported in MaxText. This page is a technical dictionary for quick lookup, reproducibility, and customization.
@@ -10,12 +12,14 @@ MaxText is an open-source, high-performance LLM framework written in Python/JAX.
- **Supported Precisions**: FP32, BF16, INT8, and FP8.
- **Ahead-of-Time Compilation (AOT)**: For faster model development/prototyping and earlier OOM detection.
-- **Quantization**: Via **Qwix** (recommended) and AQT. See Quantization [Guide](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/quantization.html).
+- **Quantization**: Via **Qwix** (recommended) and AQT. See the [Quantization guide](quantization-doc).
- **Diagnostics**: Structured error context via **`cloud_tpu_diagnostics`** (filters stack traces to user code), simple logging via `max_logging`, profiling in **XProf**, and visualization in **TensorBoard**.
- **Multi-Token Prediction (MTP)**: Enables token efficient training with multi-token prediction.
- **Elastic Training**: Fault-tolerant and dynamic scale-up/scale-down on Cloud TPUs with Pathways.
- **Flexible Remat Policy**: Provides fine-grained control over memory-compute trade-offs. Users can select pre-defined policies (like 'full' or 'minimal') or set the policy to **'custom'**.
+(supported-model-families)=
+
## Supported model families
> _**Note on GPU Coverage**: Support and tested configurations for NVIDIA GPUs can vary by model family. Please see the specific model guides for details._
diff --git a/docs/reference/models/tiering.md b/docs/reference/models/tiering.md
index ba83b30271..502ad01810 100644
--- a/docs/reference/models/tiering.md
+++ b/docs/reference/models/tiering.md
@@ -40,4 +40,4 @@ For each of the TPU platforms listed below, we present a list of optimized model
\[1\]: Performance results are subject to variations based on system configuration, software versions, and other factors. These benchmarks represent point-in-time measurements under specific conditions.
-\[2\]: Some older TFLOPS/s results are impacted by an updated calculation for causal attention ([PR #1988](https://github.com/AI-Hypercomputer/maxtext/pull/1988)), which halves the attention FLOPs. This change particularly affects configurations with large sequence lengths. For more details, please refer to the [performance metrics guide](https://maxtext.readthedocs.io/en/latest/reference/performance_metrics.html).
+\[2\]: Some older TFLOPS/s results are impacted by an updated calculation for causal attention ([PR #1988](https://github.com/AI-Hypercomputer/maxtext/pull/1988)), which halves the attention FLOPs. This change particularly affects configurations with large sequence lengths. For more details, please refer to the [performance metrics guide](performance-metrics).
diff --git a/docs/run_maxtext/run_maxtext_localhost.md b/docs/run_maxtext/run_maxtext_localhost.md
index 843c52a5f3..bb485e41a5 100644
--- a/docs/run_maxtext/run_maxtext_localhost.md
+++ b/docs/run_maxtext/run_maxtext_localhost.md
@@ -36,7 +36,7 @@ Local development on a single host TPU/GPU VM is a convenient way to run MaxText
1. Create and SSH to the single host VM of your choice. You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`. For GPUs, you can use `nvidia-h100-mega-80gb`, `nvidia-h200-141gb`, or `nvidia-b200`. For setting up a TPU VM, use the Cloud TPU documentation available at https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm. For a GPU setup, refer to the guide at https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus.
-2. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html).
+2. For instructions on installing MaxText on your VM, please refer to the [official documentation](../../install_maxtext).
#### Run a Test Training Job
diff --git a/docs/run_maxtext/run_maxtext_single_host_gpu.md b/docs/run_maxtext/run_maxtext_single_host_gpu.md
index 1aa450f4d5..1077dccd6c 100644
--- a/docs/run_maxtext/run_maxtext_single_host_gpu.md
+++ b/docs/run_maxtext/run_maxtext_single_host_gpu.md
@@ -62,7 +62,7 @@ https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-e
## Build MaxText Docker image
-For instructions on building the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html).
+For instructions on building the MaxText Docker image, please refer to the [official documentation](../../build_maxtext).
## Test
diff --git a/docs/run_maxtext/run_maxtext_via_pathways.md b/docs/run_maxtext/run_maxtext_via_pathways.md
index c1c16920ae..4879a31d8a 100644
--- a/docs/run_maxtext/run_maxtext_via_pathways.md
+++ b/docs/run_maxtext/run_maxtext_via_pathways.md
@@ -35,7 +35,7 @@ Before you can run a MaxText workload, you must complete the following setup ste
2. **Create a GKE cluster** configured for Pathways.
-3. **Build and upload a MaxText Docker image** to your project's Artifact Registry. For instructions on building and uploading the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html).
+3. **Build and upload a MaxText Docker image** to your project's Artifact Registry. For instructions on building and uploading the MaxText Docker image, please refer to the [official documentation](../../build_maxtext).
## 2. Environment configuration
diff --git a/docs/run_maxtext/run_maxtext_via_xpk.md b/docs/run_maxtext/run_maxtext_via_xpk.md
index c800366aee..4fcc490b89 100644
--- a/docs/run_maxtext/run_maxtext_via_xpk.md
+++ b/docs/run_maxtext/run_maxtext_via_xpk.md
@@ -101,7 +101,7 @@ ______________________________________________________________________
## 3. Build the MaxText Docker image
-For instructions on building the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html).
+For instructions on building the MaxText Docker image, please refer to the [official documentation](../../build_maxtext).
______________________________________________________________________
diff --git a/docs/tutorials/first_run.md b/docs/tutorials/first_run.md
index f04c6acb9c..6d2ff2d6c8 100644
--- a/docs/tutorials/first_run.md
+++ b/docs/tutorials/first_run.md
@@ -36,7 +36,7 @@ Local development is a convenient way to run MaxText on a single host. It doesn'
multiple hosts but is a good way to learn about MaxText.
1. [Create and SSH to the single host VM of your choice](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm). You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`.
-2. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For this tutorial on TPUs, install `maxtext[tpu]`.
+2. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext). For this tutorial on TPUs, install `maxtext[tpu]`.
3. After installation completes, run training on synthetic data with the following command:
```sh
@@ -70,7 +70,7 @@ You can use [demo_decoding.ipynb](https://github.com/AI-Hypercomputer/maxtext/bl
### Run MaxText on NVIDIA GPUs
-1. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For this tutorial on GPUs, install `maxtext[cuda12]`.
+1. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext). For this tutorial on GPUs, install `maxtext[cuda12]`.
2. After installation is complete, run training with the following command on synthetic data:
```sh
@@ -102,4 +102,4 @@ Google Kubernetes Engine (GKE) is the recommended way to run MaxText on multiple
## Next steps: preflight optimizations
-After you get workloads running, there are optimizations you can apply to improve performance. For more information, see [PREFLIGHT.md](https://github.com/google/maxtext/blob/main/PREFLIGHT.md).
+After you get workloads running, there are optimizations you can apply to improve performance. For more information, see [PREFLIGHT.md](https://github.com/AI-Hypercomputer/maxtext/blob/main/PREFLIGHT.md).
diff --git a/docs/tutorials/inference.md b/docs/tutorials/inference.md
index 049cdd83be..89a366c051 100644
--- a/docs/tutorials/inference.md
+++ b/docs/tutorials/inference.md
@@ -25,7 +25,7 @@ We support inference of MaxText models on vLLM via an [out-of-tree](https://gith
# Installation
-Follow the instructions in [install maxtext](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) to install MaxText. For this inference tutorial on TPU (which uses vLLM), you must install `maxtext[tpu-post-train]`, as it includes the required adapter plugin. We recommend installing from PyPI to ensure you have the latest stable version of dependencies.
+Follow the instructions in [install maxtext](../install_maxtext) to install MaxText. For this inference tutorial on TPU (which uses vLLM), you must install `maxtext[tpu-post-train]`, as it includes the required adapter plugin. We recommend installing from PyPI to ensure you have the latest stable version of dependencies.
After finishing the installation, ensure that the MaxText on vLLM adapter plugin has been installed. To do so, run the following command:
@@ -55,7 +55,7 @@ install_tpu_post_train_extra_deps
We include a script for convenient offline inference of MaxText models in `src/maxtext/inference/vllm_decode.py`. This is helpful to ensure correctness of MaxText checkpoints. This script invokes the [`LLM`](https://docs.vllm.ai/en/latest/serving/offline_inference/#offline-inference) API from vLLM.
> **_NOTE:_**
-> You will need to convert a checkpoint from HuggingFace in order to run the command. Do so first by following the steps in the [convert checkpoint](https://maxtext.readthedocs.io/en/latest/guides/checkpointing_solutions/convert_checkpoint.html) tutorial.
+> You will need to convert a checkpoint from HuggingFace in order to run the command. Do so first by following the steps in the [convert checkpoint](checkpoint-conversion) tutorial.
> **_NOTE:_**
> The remainder of this tutorial assumes that the path to the converted MaxText checkpoint is stored in \$CHECKPOINT_PATH.
@@ -125,12 +125,12 @@ curl http://localhost:8000/v1/completions \
# Reinforcement Learning (RL)
> **_NOTE:_**
-> Please refer to the [reinforcement learning tutorial](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html) to get started with reinforcement learning on MaxText.
+> Please refer to the [reinforcement learning tutorial](./posttraining/rl) to get started with reinforcement learning on MaxText.
> **_NOTE:_**
> You will need a HuggingFace token to run this command in addition to a MaxText model checkpoint. Please see the following [guide](https://huggingface.co/docs/hub/en/security-tokens) to generate one.
-To use a MaxText model architecture for samplers in reinforcement learning algorithms like GRPO, we can override the vLLM model architecture and pass in MaxText specific config arguments similar to the [online inference](https://maxtext.readthedocs.io/en/latest/tutorials/inference.html#online-inference) use-case. An example of an RL command using the MaxText model for samplers can be found below:
+To use a MaxText model architecture for samplers in reinforcement learning algorithms like GRPO, we can override the vLLM model architecture and pass in MaxText specific config arguments similar to the [online inference](./inference.md#online-inference) use-case. An example of an RL command using the MaxText model for samplers can be found below:
```bash
python3 -m src.maxtext.trainers.post_train.rl.train_rl \
diff --git a/docs/tutorials/post_training_index.md b/docs/tutorials/post_training_index.md
index 5500f60808..c8f86280a3 100644
--- a/docs/tutorials/post_training_index.md
+++ b/docs/tutorials/post_training_index.md
@@ -14,7 +14,7 @@ We’re investing in performance, scale, algorithms, models, reliability, and ea
MaxText was co-designed with key Google led innovations to provide a unified post training experience:
-- [MaxText model library](https://maxtext.readthedocs.io/en/latest/reference/models/supported_models_and_architectures.html#supported-model-families) for JAX LLMs highly optimized for TPUs
+- [MaxText model library](supported-model-families) for JAX LLMs highly optimized for TPUs
- [Tunix](https://github.com/google/tunix) for the latest algorithms and post-training techniques
- [vLLM on TPU](https://github.com/vllm-project/tpu-inference) for high performance sampling (inference) for Reinforcement Learning (RL)
- [Pathways](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro) for multi-host inference (sampling) and highly efficient weight transfer
@@ -24,13 +24,13 @@ MaxText was co-designed with key Google led innovations to provide a unified pos
## Supported techniques & models
- **SFT (Supervised Fine-Tuning)**
- - [SFT on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft.html)
- - [SFT on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html)
+ - [SFT on Single-Host TPUs](./posttraining/sft)
+ - [SFT on Multi-Host TPUs](./posttraining/sft_on_multi_host)
- **Multimodal SFT**
- - [Multimodal Support](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/multimodal.html)
+ - [Multimodal Support](./posttraining/multimodal)
- **Reinforcement Learning (RL)**
- - [RL on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html)
- - [RL on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html)
+ - [RL on Single-Host TPUs](./posttraining/rl)
+ - [RL on Multi-Host TPUs](./posttraining/rl_on_multi_host)
## Step by step RL
@@ -55,7 +55,7 @@ Pathways supercharges RL with:
## Getting started
-Start your Post-Training journey through quick experimentation with [Python Notebooks](https://maxtext.readthedocs.io/en/latest/guides/run_python_notebook.html) or our Production level tutorials for [SFT](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html) and [RL](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html).
+Start your Post-Training journey through quick experimentation with [Python Notebooks](../guides/run_python_notebook) or our Production level tutorials for [SFT](./posttraining/sft_on_multi_host) and [RL](./posttraining/rl_on_multi_host).
## More tutorials
diff --git a/docs/tutorials/posttraining/full_finetuning.md b/docs/tutorials/posttraining/full_finetuning.md
index 47a7cd3d1e..4602374d3d 100644
--- a/docs/tutorials/posttraining/full_finetuning.md
+++ b/docs/tutorials/posttraining/full_finetuning.md
@@ -24,7 +24,7 @@ In this tutorial we use a single host TPU VM such as `v6e-8/v5p-8`. Let's get st
## Install dependencies
-For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html) and use the `maxtext[tpu]` installation path to include all necessary dependencies.
+For instructions on installing MaxText on your VM, please refer to the [official documentation](../../install_maxtext) and use the `maxtext[tpu]` installation path to include all necessary dependencies.
## Setup environment variables
@@ -70,7 +70,7 @@ export MAXTEXT_CKPT_PATH= # e.g., gs://my-bucket/my-model-checkpoint/
### Option 2: Converting a Hugging Face checkpoint
-Refer the steps in [Hugging Face to MaxText](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/checkpointing_solutions/convert_checkpoint.html#hugging-face-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
+Refer the steps in [Hugging Face to MaxText](hf-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
```bash
export MAXTEXT_CKPT_PATH= # gs://my-bucket/my-checkpoint-directory/0/items
diff --git a/docs/tutorials/posttraining/knowledge_distillation.md b/docs/tutorials/posttraining/knowledge_distillation.md
index ea6222a09c..bf26784bad 100644
--- a/docs/tutorials/posttraining/knowledge_distillation.md
+++ b/docs/tutorials/posttraining/knowledge_distillation.md
@@ -49,7 +49,7 @@ export RUN_NAME= # e.g., distill-20260115
To install MaxText and its dependencies for post-training (including vLLM for the teacher), run the following:
-1. Follow the [MaxText installation instructions](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#install-maxtext).
+1. Follow the [MaxText installation instructions](../../install_maxtext).
2. Install the additional dependencies for post-training:
diff --git a/docs/tutorials/posttraining/rl.md b/docs/tutorials/posttraining/rl.md
index 060a2c44cd..e95fc74d56 100644
--- a/docs/tutorials/posttraining/rl.md
+++ b/docs/tutorials/posttraining/rl.md
@@ -44,7 +44,7 @@ Let's get started!
## Install MaxText and post-training dependencies
-For instructions on installing MaxText with post-training dependencies on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) and use the `maxtext[tpu-post-train]` installation path to include all necessary post-training dependencies.
+For instructions on installing MaxText with post-training dependencies on your VM, please refer to the [official documentation](../../install_maxtext) and use the `maxtext[tpu-post-train]` installation path to include all necessary post-training dependencies.
> **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
@@ -98,7 +98,7 @@ export MAXTEXT_CKPT_PATH= # e.g., gs://my-bucket/my-model-checkpoint/
### Option 2: Converting from a Hugging Face checkpoint
-Refer the steps in [Hugging Face to MaxText](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/checkpointing_solutions/convert_checkpoint.html#hugging-face-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
+Refer the steps in [Hugging Face to MaxText](hf-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
```bash
export MAXTEXT_CKPT_PATH= # e.g., gs://my-bucket/my-model-checkpoint/0/items
diff --git a/docs/tutorials/posttraining/rl_on_multi_host.md b/docs/tutorials/posttraining/rl_on_multi_host.md
index 4cae59bebd..2602054214 100644
--- a/docs/tutorials/posttraining/rl_on_multi_host.md
+++ b/docs/tutorials/posttraining/rl_on_multi_host.md
@@ -64,7 +64,7 @@ Before starting, ensure you have:
## Build and upload MaxText Docker image
-For instructions on building and uploading the MaxText Docker image with post-training dependencies, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html).
+For instructions on building and uploading the MaxText Docker image with post-training dependencies, please refer to the [official documentation](../../build_maxtext).
## Setup Environment Variables
diff --git a/docs/tutorials/posttraining/sft.md b/docs/tutorials/posttraining/sft.md
index 12465dc5ff..a9602be320 100644
--- a/docs/tutorials/posttraining/sft.md
+++ b/docs/tutorials/posttraining/sft.md
@@ -26,7 +26,7 @@ In this tutorial we use a single host TPU VM such as `v6e-8/v5p-8`. Let's get st
## Install MaxText and Post-Training dependencies
-For instructions on installing MaxText with post-training dependencies on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) and use the `maxtext[tpu-post-train]` installation path to include all necessary post-training dependencies.
+For instructions on installing MaxText with post-training dependencies on your VM, please refer to the [official documentation](../../install_maxtext) and use the `maxtext[tpu-post-train]` installation path to include all necessary post-training dependencies.
> **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
@@ -82,7 +82,7 @@ export MAXTEXT_CKPT_PATH= # e.g., gs://my-bucket/my-model-checkpoint/
### Option 2: Converting a Hugging Face checkpoint
-Refer the steps in [Hugging Face to MaxText](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/checkpointing_solutions/convert_checkpoint.html#hugging-face-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
+Refer the steps in [Hugging Face to MaxText](hf-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
```sh
export MAXTEXT_CKPT_PATH= # e.g., gs://my-bucket/my-model-checkpoint/0/items
diff --git a/docs/tutorials/posttraining/sft_on_multi_host.md b/docs/tutorials/posttraining/sft_on_multi_host.md
index 4f899dfdad..c44d24cd64 100644
--- a/docs/tutorials/posttraining/sft_on_multi_host.md
+++ b/docs/tutorials/posttraining/sft_on_multi_host.md
@@ -37,7 +37,7 @@ Before starting, ensure you have:
## Build and upload MaxText Docker image
-For instructions on building and uploading the MaxText Docker image with post-training dependencies, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html).
+For instructions on building and uploading the MaxText Docker image with post-training dependencies, please refer to the [official documentation](../../build_maxtext).
## Create GKE cluster
@@ -129,7 +129,7 @@ checkpoint_storage_use_ocdbt=$((1 - USE_PATHWAYS))
### Option 2: Converting a Hugging Face checkpoint
-Refer the steps in [Hugging Face to MaxText](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/checkpointing_solutions/convert_checkpoint.html#hugging-face-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
+Refer the steps in [Hugging Face to MaxText](hf-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
```bash
export MAXTEXT_CKPT_PATH= # gs://my-bucket/my-checkpoint-directory/0/items