This guide covers how to use AlgEngine for training end-to-end autonomous driving models, running evaluations, extracting rare cases, and fine-tuning. AlgEngine is built on MMDetection3D and supports UniAD, VADv2, and HydraMDP architectures.
- Quick Reference
- Training
- Evaluation
- Rare Case Extraction
- Fine-Tuning
- Configuration
- Model Architectures
- Troubleshooting
cd projects/AlgEngine
# Training (8 GPUs)
./scripts/e2e_dist_train.sh <config> <num_gpus> [resume_checkpoint]
# Open-loop evaluation
./scripts/e2e_dist_eval.sh <config> <checkpoint> <num_gpus>
# Rare case extraction
python scripts/rare_case_sampling_by_pdms.py \
--pdm-result <csv_file> \
--base-split <yaml_file> \
--output-dir <output_dir>
# Closed-loop evaluation
bash scripts/run_ray_distributed_testing.sh <config> <checkpoint> <model_name> <data_type> <react_type>Before training, ensure:
- ✅ AlgEngine environment installed (
algengineconda env) - ✅ Data prepared (see Data Organization)
- ✅ Pre-trained backbone weights downloaded
Train a model on 50% of the training data:
conda activate algengine
cd projects/AlgEngine
# Train VADv2 with 50% data (8 GPUs)
./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_50pct.py 8Arguments:
<config>: Configuration file path<num_gpus>: Number of GPUs to use[resume_checkpoint](optional): Checkpoint to resume from
./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_100pct.py 8Resume from a checkpoint:
./scripts/e2e_dist_train.sh \
configs/worldengine/e2e_vadv2_50pct.py \
8 \
work_dirs/e2e_vadv2_50pct/latest.pthAuto-resume: If latest.pth exists in work_dirs/, training will auto-resume.
# Watch training log
tail -f work_dirs/e2e_vadv2_50pct/logs/train.*
# TensorBoard (if enabled)
tensorboard --logdir work_dirs/e2e_vadv2_50pct/tf_logsKey metrics to monitor:
loss: Total training loss (should decrease)loss_planning: Planning lossloss_track: Tracking lossade_4s: Average displacement error at 4 secondsfde_4s: Final displacement error at 4 seconds
work_dirs/e2e_vadv2_50pct/
├── e2e_vadv2_50pct.py # Config backup
├── logs/
│ └── train.26040614* # Training logs
├── epoch_1.pth # Checkpoints
├── epoch_2.pth
...
├── epoch_20.pth
└── latest.pth # Symlink to latest checkpoint
Evaluate model predictions against ground truth trajectories.
conda activate algengine
cd projects/AlgEngine
# Evaluate on navtest (8 GPUs)
./scripts/e2e_dist_eval.sh \
configs/worldengine/e2e_vadv2_50pct.py \
work_dirs/e2e_vadv2_50pct/epoch_20.pth \
8Output:
work_dirs/e2e_vadv2_50pct/
└── navtest.csv # Evaluation results
Evaluate on known rare navtest cases:
./scripts/e2e_dist_eval_navtest_failures.sh \
configs/worldengine/e2e_vadv2_50pct.py \
work_dirs/e2e_vadv2_50pct/epoch_20.pth \
8Output:
work_dirs/e2e_vadv2_50pct/
└── navtest_failures.csv # rare navtest cases only
Open-loop metrics CSV format:
token,ade_4s,fde_4s,no_at_fault_collisions,drivable_area_compliance,ego_progress,comfort,score
abc123,0.42,0.85,1.0,0.95,0.88,0.92,0.89
...Key metrics:
ade_4s: Average trajectory error over 4 seconds (meters, lower is better)fde_4s: Final position error at 4 seconds (meters, lower is better)no_at_fault_collisions: Collision avoidance rate (0-1, higher is better)drivable_area_compliance: Stay in drivable area (0-1, higher is better)ego_progress: Route completion (0-1, higher is better)comfort: Comfort metric (0-1, higher is better)score: Overall PDM score (0-1, higher is better)
Evaluate model in simulation (requires SimEngine).
See SimEngine Usage Guide for:
- Single-GPU testing
- Multi-GPU distributed testing
- Reactive vs non-reactive modes
Quick example:
cd projects/AlgEngine
bash scripts/run_ray_distributed_testing.sh \
$WORLDENGINE_ROOT/projects/AlgEngine/configs/worldengine/e2e_vadv2_50pct.py \
$WORLDENGINE_ROOT/projects/AlgEngine/work_dirs/e2e_vadv2_50pct/epoch_20.pth \
e2e_vadv2_50pct_epoch20 \
navtest_failures \
NRExtract failure scenarios from evaluation results for targeted fine-tuning.
conda activate algengine
cd projects/AlgEngine
python scripts/rare_case_sampling_by_pdms.py \
--pdm-result work_dirs/e2e_vadv2_50pct/navtest.csv \
--base-split configs/navsim_splits/navtest_split/navtest.yaml \
--output-dir configs/navsim_splits/navtest_split/e2e_vadv2_50pct_rareArguments:
--pdm-result: CSV file with evaluation metrics--base-split: Base scenario split YAML file--output-dir: Directory to save extracted split files
The script generates three rare case split files:
configs/navsim_splits/navtest_split/e2e_vadv2_50pct_rare/
├── navtest_collision.yaml # Collision scenarios
├── navtest_off_road.yaml # Off-road scenarios
└── navtest_ep_1pct.yaml # Low ego-progress (bottom 1%)
Edit the script to customize:
# In rare_case_sampling_by_pdms.py
# Change collision threshold
collision_scenarios = df[df['no_at_fault_collisions'] < 0.95] # From 1.0
# Change ego-progress percentile
ep_threshold = df['ego_progress'].quantile(0.05) # From 0.01 (1% -> 5%)# Check how many scenarios were extracted
wc -l configs/navsim_splits/navtest_split/e2e_vadv2_50pct_rare/*.yaml
# View first few scenarios
head -20 configs/navsim_splits/navtest_split/e2e_vadv2_50pct_rare/navtest_collision.yamlFine-tune a trained model on rare cases using reinforcement learning.
Important: Rollout data must be generated by SimEngine before fine-tuning. This involves:
-
Run SimEngine Rollout to generate trajectory data:
# Multi-GPU distributed rollout (recommended for large datasets) export WORLDENGINE_ROOT=/path/to/WorldEngine cd projects/SimEngine bash scripts/run_ray_distributed_rollout.sh \ $WORLDENGINE_ROOT/projects/AlgEngine/configs/worldengine/e2e_vadv2_50pct.py \ $WORLDENGINE_ROOT/data/alg_engine/ckpts/e2e_vadv2_50pct_ep8.pth \ e2e_vadv2_50pct \ navtrain_50pct_collision \ navtrain
-
Rollout Output is saved to:
experiments/closed_loop_exps/e2e_vadv2_50pct/navtrain_NR/ └── WE_output/ └── openscene_format/ ├── sensor_blobs/ # Camera images, LiDAR ├── meta_datas/ # Per-scenario metadata ├── pdms_pkl/ # Metric pdms pkl └── all_scenes_pdm_averages_NR.csv -
Reorganize to AlgEngine Format (creates
openscene-syntheticdataset):conda activate simengine cd projects/SimEngine python scripts/export_simulation_data.py \ --test_path experiments/closed_loop_exps/e2e_vadv2_50pct/navtrain_NR \ --appendix 260406 # Date suffix for versioning, Default None
Output location:
data/alg_engine/openscene-synthetic/ -
Verify Data Structure:
data/alg_engine/openscene-synthetic/ ├── sensor_blobs/ # Replayed scenario sensor data ├── meta_datas/ # Metadata └── pdms_pkl/ # Metric pdms pkl
For detailed SimEngine usage, see SimEngine Usage Guide.
conda activate algengine
cd projects/AlgEngine
# Fine-tune on extracted rare cases (8 GPUs)
./scripts/e2e_dist_train.sh \
configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py \
8 \
work_dirs/e2e_vadv2_50pct/epoch_20.pthArguments:
- Config with
_rlft_rare_logsuffix (uses rare case splits) - Number of GPUs
- Base checkpoint to fine-tune from
The _rlft_rare_log config typically includes:
# configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py
# Use rare case splits
data = dict(
train=dict(
ann_file='merged_infos_navformer/nuplan_openscene_navtrain.pkl',
scenario_filter=[
'configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_ep8/navtrain_50pct_collision.yaml',
'configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_ep8/navtrain_50pct_off_road.yaml',
'configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_ep8/navtrain_50pct_ep_1pct.yaml',
]
)
)
# RL training settings
optimizer = dict(type='AdamW', lr=5e-5) # Lower learning rate
total_epochs = 8 # Fewer epochs for fine-tuningwork_dirs/e2e_vadv2_50pct_rlft_rare_log/
├── e2e_vadv2_50pct_rlft_rare_log.py
├── logs/
│ └── train.*
├── epoch_1.pth
...
└── epoch_8.pth
cd projects/AlgEngine
# Open-loop evaluation
./scripts/e2e_dist_eval.sh \
configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py \
work_dirs/e2e_vadv2_50pct_rlft_rare_log/epoch_8.pth \
8
# Closed-loop evaluation
bash scripts/run_ray_distributed_testing.sh \
$WORLDENGINE_ROOT/projects/AlgEngine/configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py \
$WORLDENGINE_ROOT/projects/AlgEngine/work_dirs/e2e_vadv2_50pct_rlft_rare_log/epoch_8.pth \
e2e_vadv2_50pct_rlft \
navtest_failures \
NRAlgEngine uses hierarchical configuration with MMDetection3D. For a detailed reference of all config parameters, variants, and their relationships, see the Configuration Guide.
configs/
├── _base_/
│ └── default_runtime.py # Base runtime settings
├── worldengine/
│ ├── e2e_vadv2_50pct.py # 50% data training
│ ├── e2e_vadv2_100pct.py # 100% data training
│ ├── e2e_vadv2_50pct_rlft_rare_log.py # Rare case fine-tuning
│ └── ...
└── navsim_splits/
├── navtrain_split/
│ ├── navtrain.yaml # Full training set
│ ├── navtrain_50pct.yaml # 50% subset
│ └── e2e_vadv2_50pct_ep8/ # Rare case splits
│ ├── navtrain_50pct_collision.yaml
│ ├── navtrain_50pct_off_road.yaml
│ └── navtrain_50pct_ep_1pct.yaml
└── navtest_split/
├── navtest.yaml # Full test set
└── navtest_failures.yaml # Failure subset
# Model architecture
model = dict(
type='VADv2', # or 'UniAD', 'HydraMDP'
num_query=900,
num_classes=7,
planning_steps=8,
img_backbone=dict(type='ResNet50', ...),
img_neck=dict(type='FPN', ...),
)
# BEV configuration
bev_h_, bev_w_ = 200, 200
patch_size = [102.4, 102.4] # Physical range (meters)
# Input modality
input_modality = dict(
use_lidar=False,
use_camera=True, # 8 cameras
use_radar=False,
use_external=True # CAN bus
)
# Training
total_epochs = 20
optimizer = dict(type='AdamW', lr=2e-4, weight_decay=0.01)
lr_config = dict(
policy='CosineAnnealing',
warmup='linear',
warmup_iters=500,
warmup_ratio=1.0 / 3,
)
# Data
data = dict(
samples_per_gpu=1,
workers_per_gpu=4,
train=dict(
ann_file='merged_infos_navformer/nuplan_openscene_navtrain.pkl',
scenario_filter='configs/navsim_splits/navtrain_split/navtrain_50pct.yaml',
),
val=dict(
ann_file='merged_infos_navformer/nuplan_openscene_navtest.pkl',
scenario_filter='configs/navsim_splits/navtest_split/navtest.yaml',
),
)Override config parameters at runtime:
./scripts/e2e_dist_train.sh \
configs/worldengine/e2e_vadv2_50pct.py \
8 \
--cfg-options \
optimizer.lr=1e-4 \
total_epochs=30 \
data.samples_per_gpu=2AlgEngine supports multiple end-to-end autonomous driving architectures.
Features:
- Vector-based scene representation
- Planning-oriented perception
- Efficient trajectory prediction
Config: configs/worldengine/e2e_vadv2_*.py
Best for: General driving scenarios, fast inference
Features:
- Unified perception-prediction-planning
- Multi-task learning
- Strong generalization
Config: configs/worldengine/e2e_uniad_*.py
Best for: Complex scenarios, research
Features:
- Multi-modal trajectory prediction
- Distribution-aware planning
- Behavior world model integration
Config: configs/worldengine/e2e_hydramdp_*.py
Best for: Safety-critical scenarios, rare cases
# Train UniAD instead of VADv2
./scripts/e2e_dist_train.sh configs/worldengine/e2e_uniad_50pct.py 8
# Evaluate HydraMDP
./scripts/e2e_dist_eval.sh \
configs/worldengine/e2e_hydramdp_50pct.py \
work_dirs/e2e_hydramdp_50pct/epoch_20.pth \
8For very large models or datasets:
# Node 0 (master)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16 # Total GPUs
export RANK=0 # Node rank
./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_100pct.py 8
# Node 1 (worker)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16
export RANK=8
./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_100pct.py 8Enable automatic mixed precision (AMP) for faster training:
# In config
fp16 = dict(loss_scale='dynamic')For large batch sizes with limited GPU memory:
# In config
data = dict(
samples_per_gpu=1,
workers_per_gpu=4,
)
# Set gradient accumulation steps
runner = dict(
max_epochs=20,
gradient_accumulation_steps=4, # Effective batch size = 1 * 8 GPUs * 4 = 32
)Solution:
# Reduce batch size
# Edit config: data.samples_per_gpu = 1 (from 2)
# Reduce BEV resolution
# Edit config: bev_h_, bev_w_ = 150, 150 (from 200, 200)
# Use gradient checkpointing
# Edit config: model.img_backbone.with_cp = TruePossible causes:
- Learning rate too high/low
- Data loading issues
- Incorrect pre-trained weights
Solution:
# Check data loading
python tools/analysis_tools/browse_dataset.py configs/worldengine/e2e_vadv2_50pct.py
# Verify pre-trained weights loaded
grep "load checkpoint" work_dirs/*/logs/train.*
# Try different learning rate
./scripts/e2e_dist_train.sh ... --cfg-options optimizer.lr=1e-4Solution:
# Check if processes are stuck
ps aux | grep python
# Kill stuck processes
pkill -f "test.py"
# Restart evaluation with fewer GPUs
./scripts/e2e_dist_eval.sh ... 4 # Use 4 instead of 8Solution:
# Ensure you're in the right environment
conda activate algengine
# Verify MMCV installation
python -c "import mmcv; print(mmcv.__version__)"
# Reinstall MMDetection3D if needed
pip uninstall mmdet3d -y
pip install mmdet3d==1.0.0rc6Solution:
# Use a previous checkpoint
./scripts/e2e_dist_train.sh ... work_dirs/*/epoch_18.pth # Instead of epoch_20
# Or train from scratch
rm work_dirs/e2e_vadv2_50pct/latest.pth
./scripts/e2e_dist_train.sh ...- Increase workers:
data.workers_per_gpu = 8(if CPU/RAM allows) - Use SSD: Store data on fast NVMe SSD
- Mixed precision: Enable
fp16 = dict(loss_scale='dynamic') - Persistent workers:
data.persistent_workers = True
- Reduce batch size:
data.samples_per_gpu = 1 - Lower BEV resolution:
bev_h_, bev_w_ = 150, 150 - Gradient checkpointing:
model.img_backbone.with_cp = True - Clear cache:
torch.cuda.empty_cache()in code
- Even GPU allocation: Use same GPU type across nodes
- InfiniBand: Use high-speed interconnect for multi-node
- Shared filesystem: Use NFS/Lustre for data loading
- Monitor network: Watch for communication bottlenecks
- Run simulations: See SimEngine Usage Guide
- Understand evaluation: See Quick Start Guide
For questions, visit GitHub Discussions.