diff --git a/mkdocs.yml b/mkdocs.yml
index 5de85a37d..debf86167 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -171,12 +171,14 @@ plugins:
"examples/distributed-training/trl.md": "docs/examples/training/trl.md"
"examples/distributed-training/axolotl.md": "docs/examples/training/axolotl.md"
"examples/distributed-training/ray-ragen.md": "docs/examples/training/ray-ragen.md"
+ "examples/distributed-training/miles.md": "docs/examples/training/miles.md"
# Single-node and distributed training merged under Training
"docs/examples/single-node-training/trl.md": "docs/examples/training/trl.md"
"docs/examples/single-node-training/axolotl.md": "docs/examples/training/axolotl.md"
"docs/examples/distributed-training/trl.md": "docs/examples/training/trl.md"
"docs/examples/distributed-training/axolotl.md": "docs/examples/training/axolotl.md"
"docs/examples/distributed-training/ray-ragen.md": "docs/examples/training/ray-ragen.md"
+ "docs/examples/distributed-training/miles.md": "docs/examples/training/miles.md"
"examples/clusters/aws.md": "docs/examples/clusters/aws.md"
"examples/clusters/gcp.md": "docs/examples/clusters/gcp.md"
"examples/clusters/lambda.md": "docs/examples/clusters/lambda.md"
@@ -312,6 +314,7 @@ nav:
- TRL: docs/examples/training/trl.md
- Axolotl: docs/examples/training/axolotl.md
- Ray+RAGEN: docs/examples/training/ray-ragen.md
+ - Miles: docs/examples/training/miles.md
- Clusters:
- AWS: docs/examples/clusters/aws.md
- GCP: docs/examples/clusters/gcp.md
diff --git a/mkdocs/docs/concepts/tasks.md b/mkdocs/docs/concepts/tasks.md
index 43eb8e80c..d3a3515b8 100644
--- a/mkdocs/docs/concepts/tasks.md
+++ b/mkdocs/docs/concepts/tasks.md
@@ -151,7 +151,7 @@ Jobs on each node communicate using their private IP addresses. Use `DSTACK_MAST
`dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks.
!!! info "Examples"
- See the training examples for [TRL](../examples/training/trl.md#distributed-training), [Axolotl](../examples/training/axolotl.md#distributed-training), and [Ray+RAGEN](../examples/training/ray-ragen.md).
+ See the training examples for [TRL](../examples/training/trl.md#distributed-training), [Axolotl](../examples/training/axolotl.md#distributed-training), [Ray+RAGEN](../examples/training/ray-ragen.md), and [Miles](../examples/training/miles.md).
See the cluster examples for [AWS](../examples/clusters/aws.md), [GCP](../examples/clusters/gcp.md), [Lambda](../examples/clusters/lambda.md), [Crusoe](../examples/clusters/crusoe.md), [Nebius](../examples/clusters/nebius.md), and [NCCL/RCCL tests](../examples/clusters/nccl-rccl-tests.md).
diff --git a/mkdocs/docs/examples.md b/mkdocs/docs/examples.md
index 59203cdf8..ad8bf4839 100644
--- a/mkdocs/docs/examples.md
+++ b/mkdocs/docs/examples.md
@@ -49,6 +49,17 @@ hide:
Fine-tune an agent on multiple nodes with RAGEN, verl, and Ray.
+
+
+
+ Miles
+
+
+
+ RL-fine-tune Qwen2.5-32B with Miles.
+
+
diff --git a/mkdocs/docs/examples/training/miles.md b/mkdocs/docs/examples/training/miles.md
new file mode 100644
index 000000000..59451d150
--- /dev/null
+++ b/mkdocs/docs/examples/training/miles.md
@@ -0,0 +1,283 @@
+---
+title: Miles
+description: RL post-training Qwen2.5-32B with Miles, SGLang, Megatron-LM, and Ray across two 8xH100 nodes
+---
+
+# Miles
+
+This example shows how to use `dstack` and [Miles](https://github.com/radixark/miles)
+for reinforcement learning (RL) post-training of a 32B language model with
+[GRPO](https://arxiv.org/abs/2402.03300) across a multi-node cluster.
+Miles integrates [SGLang](https://github.com/sgl-project/sglang) for
+high-throughput rollouts, [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
+for training, and [Ray](https://docs.ray.io/en/latest/) to coordinate the
+trainer and rollout actors across nodes.
+
+Here we fine-tune `Qwen/Qwen2.5-32B-Instruct` on the
+[GSM8K](https://huggingface.co/datasets/openai/gsm8k) dataset.
+
+!!! info "Prerequisites"
+ Multi-node tasks require a [fleet](../../concepts/fleets.md) with
+ `placement` set to [`cluster`](../../concepts/fleets.md#cluster-placement).
+
+## Run a Ray cluster
+
+### Define a configuration
+
+The [task](../../concepts/tasks.md) below starts Ray on two nodes and prepares
+each node by downloading the model and dataset, then converting the checkpoint
+to Megatron's `torch_dist` format.
+
+
+
+```yaml
+type: task
+name: miles-qwen32b-h100
+nodes: 2
+image: radixark/miles:sglang-miles-v0.5.12
+env:
+ - WANDB_API_KEY
+ - PYTHONPATH=/root/Megatron-LM
+ - NCCL_DEBUG=INFO
+ - MODEL_ID=Qwen/Qwen2.5-32B-Instruct
+commands:
+ # 1. Download the model and dataset.
+ - pip install -U "huggingface_hub[cli]"
+ - hf download "$MODEL_ID" --local-dir "/root/$(basename "$MODEL_ID")"
+ - hf download --repo-type dataset openai/gsm8k --local-dir /root/gsm8k
+ # 2. Convert the Hugging Face checkpoint to Megatron torch_dist.
+ - |
+ MODEL_NAME="$(basename "$MODEL_ID")"
+ cd /root/miles && python tools/convert_hf_to_torch_dist.py \
+ --swiglu \
+ --num-layers 64 \
+ --hidden-size 5120 \
+ --ffn-hidden-size 27648 \
+ --num-attention-heads 40 \
+ --use-rotary-position-embeddings \
+ --disable-bias-linear \
+ --add-qkv-bias \
+ --normalization RMSNorm \
+ --norm-epsilon 1e-5 \
+ --rotary-base 1000000 \
+ --group-query-attention \
+ --num-query-groups 8 \
+ --vocab-size 152064 \
+ --untie-embeddings-and-output-weights \
+ --hf-checkpoint "/root/$MODEL_NAME" \
+ --save "/root/${MODEL_NAME}_torch_dist"
+ # 3. Start Ray.
+ - |
+ if [ $DSTACK_NODE_RANK = 0 ]; then
+ ray start --head --port=6379
+ else
+ ray start --address=$DSTACK_MASTER_NODE_IP:6379
+ fi
+ports:
+ - 8265
+resources:
+ gpu: H100:8
+ shm_size: 32GB
+ disk: 1000GB..
+volumes:
+ - /checkpoints:/checkpoints
+```
+
+
+
+### Run the configuration
+
+Run the task with [`dstack apply`](../../reference/cli/dstack/apply.md). By
+default, `dstack apply` forwards the Ray dashboard port to `localhost:8265`.
+
+
+
+```shell
+$ export WANDB_API_KEY=...
+$ dstack apply -f miles-qwen32b-h100.dstack.yml
+```
+
+
+
+While `dstack apply` is attached, you can submit Ray jobs through
+`localhost:8265`. If you detach or run from another machine, use
+[`dstack attach`](../../reference/cli/dstack/attach.md) to re-attach and make
+the dashboard port accessible on `localhost`.
+
+> To run on a single node, remove `nodes` or set it to `1`, then submit the job
+> with `NUM_NODES=1`. In this case, `placement: cluster` is not required.
+
+## Submit Ray jobs
+
+Install `ray` locally before submitting jobs:
+
+
+
+```shell
+$ pip install ray
+```
+
+
+
+The submit script below runs the Miles training job on the Ray cluster. The
+model is sharded across all 8 GPUs per node with tensor parallelism, and SGLang
+uses the same 8 GPUs per node for rollout.
+
+
+
+```bash
+#!/bin/bash
+set -euo pipefail
+
+export RAY_ADDRESS=http://localhost:8265
+
+: "${NUM_NODES:?NUM_NODES is not set}"
+: "${GPUS_PER_NODE:?GPUS_PER_NODE is not set}"
+
+MODEL_ID="Qwen/Qwen2.5-32B-Instruct"
+MODEL_NAME="$(basename "$MODEL_ID")"
+HF_CHECKPOINT="/root/$MODEL_NAME"
+REF_LOAD="/root/${MODEL_NAME}_torch_dist"
+PROMPT_DATA="/root/gsm8k/main/train-00000-of-00001.parquet"
+EVAL_PROMPT_DATA="/root/gsm8k/main/test-00000-of-00001.parquet"
+INPUT_KEY="question"
+LABEL_KEY="answer"
+EVAL_DATASET_NAME="gsm8k"
+CHECKPOINT_DIR="/checkpoints/${MODEL_NAME}-${EVAL_DATASET_NAME}"
+SAVE_INTERVAL=10
+WANDB_PROJECT="dstack-miles-RL"
+WANDB_GROUP="${MODEL_NAME}-gsm8k-${NUM_NODES}node-${GPUS_PER_NODE}gpu"
+WANDB_NAME="rollout-$(date +%Y%m%d-%H%M%S)"
+ROLLOUT_GPUS_PER_ENGINE=8
+
+CMD='cd /root/miles && python3 train.py \
+ --actor-num-nodes '"$NUM_NODES"' \
+ --actor-num-gpus-per-node '"$GPUS_PER_NODE"' \
+ --num-gpus-per-node '"$GPUS_PER_NODE"' \
+ --rollout-num-gpus-per-engine '"$ROLLOUT_GPUS_PER_ENGINE"' \
+ --sglang-server-concurrency 128 \
+ --colocate \
+ --calculate-per-token-loss \
+ --use-miles-router \
+ --swiglu \
+ --num-layers 64 \
+ --hidden-size 5120 \
+ --ffn-hidden-size 27648 \
+ --num-attention-heads 40 \
+ --use-rotary-position-embeddings \
+ --disable-bias-linear \
+ --add-qkv-bias \
+ --normalization RMSNorm \
+ --norm-epsilon 1e-5 \
+ --rotary-base 1000000 \
+ --group-query-attention \
+ --num-query-groups 8 \
+ --vocab-size 152064 \
+ --untie-embeddings-and-output-weights \
+ --hf-checkpoint '"$HF_CHECKPOINT"' \
+ --ref-load '"$REF_LOAD"' \
+ --prompt-data '"$PROMPT_DATA"' \
+ --input-key '"$INPUT_KEY"' \
+ --label-key '"$LABEL_KEY"' \
+ --apply-chat-template \
+ --rollout-shuffle \
+ --rm-type math \
+ --num-rollout 20 \
+ --rollout-batch-size 8 \
+ --n-samples-per-prompt 8 \
+ --rollout-max-response-len 512 \
+ --rollout-temperature 1 \
+ --global-batch-size 64 \
+ --eval-interval 5 \
+ --eval-prompt-data '"$EVAL_DATASET_NAME"' '"$EVAL_PROMPT_DATA"' \
+ --n-samples-per-eval-prompt 1 \
+ --eval-max-response-len 512 \
+ --eval-top-k 1 \
+ --tensor-model-parallel-size 8 \
+ --sequence-parallel \
+ --pipeline-model-parallel-size 1 \
+ --context-parallel-size 1 \
+ --expert-model-parallel-size 1 \
+ --expert-tensor-parallel-size 1 \
+ --use-dynamic-batch-size \
+ --max-tokens-per-gpu 9216 \
+ --advantage-estimator grpo \
+ --use-kl-loss \
+ --kl-loss-coef 0.00 \
+ --kl-loss-type low_var_kl \
+ --kl-coef 0.00 \
+ --entropy-coef 0.00 \
+ --eps-clip 0.2 \
+ --eps-clip-high 0.28 \
+ --optimizer adam \
+ --lr 1e-6 \
+ --lr-decay-style constant \
+ --weight-decay 0.1 \
+ --adam-beta1 0.9 \
+ --adam-beta2 0.98 \
+ --sglang-mem-fraction-static 0.7 \
+ --use-wandb \
+ --wandb-host https://wandb.ai/ \
+ --wandb-project '"$WANDB_PROJECT"' \
+ --wandb-group '"$WANDB_GROUP"' \
+ --wandb-exp-name '"$WANDB_NAME"' \
+ --attention-dropout 0.0 \
+ --hidden-dropout 0.0 \
+ --accumulate-allreduce-grads-in-fp32 \
+ --attention-softmax-in-fp32 \
+ --attention-backend flash \
+ --save '"$CHECKPOINT_DIR"' \
+ --save-interval '"$SAVE_INTERVAL"''
+
+# GLOO_SOCKET_IFNAME=eth0 is required for multi-node Gloo process group init.
+# Without it, Gloo resolves to a loopback address (127.0.1.1) instead of the
+# inter-node interface, causing `init_gloo_group()` to timeout.
+RUNTIME_ENV_JSON=$(cat <
+
+Submit the job with the same cluster shape as the task:
+
+
+
+```shell
+$ NUM_NODES=2 GPUS_PER_NODE=8 bash submit-miles-train.sh
+```
+
+
+
+!!! info "Training parameters"
+ 1. `--tensor-model-parallel-size 8` shards the 32B model across all 8 GPUs
+ per node.
+ 2. `--rollout-num-gpus-per-engine 8` starts SGLang with TP-8 on each node.
+ 3. `--sglang-server-concurrency` sets how many requests SGLang processes
+ concurrently.
+ 4. `--max-tokens-per-gpu 9216` sets the per-GPU token budget. Lower this if
+ Megatron OOMs during training.
+ 5. `--sglang-mem-fraction-static 0.7` sets the SGLang KV cache memory
+ fraction. Lower this if Megatron OOMs at startup.
+
+Using Ray via `dstack` gives you access to the Ray ecosystem while benefiting
+from `dstack`'s provisioning capabilities.
+
+!!! info "What's next"
+ 1. Read about [distributed tasks](../../concepts/tasks.md#distributed-tasks)
+ and [fleets](../../concepts/fleets.md)
+ 2. See the [SGLang inference](../inference/sglang.md) example
+ 3. Browse Miles' [examples](https://github.com/radixark/miles/tree/main/examples)