๐ค The first framework enabling humanoid loco-manipulation with egocentric human demonstrations.
Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. We present EGOHUMANOID, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data.
To bridge the embodiment gap, we introduce a systematic alignment pipeline with two key components: view alignment reduces visual discrepancies; action alignment maps human motions into a unified action space for humanoid control.
Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51%, particularly in unseen environments.
- ๐ Overview
- ๐ ๏ธ Hardware Setup
- ๐ป Environment Setup
- ๐ฅ Data Collection
- โก Data Processing
- ๐ค Model Training
- ๐ Deployment
- ๐ Requirements Summary
- ๐ Citation
๐ Detailed Documentation:
- Robot Data Collection Guide โ
- Human Data Collection Guide โ
- Data Processing Pipeline โ
- View Alignment โ
graph LR
A[๐ค Human Demo<br/>PICO VR + ZED] -->|Collect| B[๐น Egocentric Data]
B -->|Process| C[โก View Alignment]
C -->|Train| D[๐ค VLA Model<br/>ฯโ.โ
-based]
D -->|Deploy| E[๐ Unitree G1<br/>Robot Control]
EgoHumanoid consists of four main components:
| Collect synchronized multi-modal data from both humanoid robots (Unitree G1) and human demonstrators (PICO VR + ZED Mini) | Process, align, and retarget human demonstrations to robot action space | Fine-tune vision-language-action models ฯโ.โ on the processed datasets | Deploy trained policies on real humanoid robots with real-time inference |
|
Required:
|
Required:
|
Setup Requirements:
- ๐ก PICO and PC on same network
- ๐ฏ Full-body tracking activated
- ๐ ZED Mini via USB 3.0
Note
We use ZED Mini instead of ZED X Mini (as in the paper) for easier accessibility and setup.
| ๐ฅ๏ธ OS | Ubuntu 22.04 (tested and recommended) |
| ๐ Python | 3.11+ |
| ๐ฎ GPU |
โข โฅ 8 GB VRAM (inference) โข โฅ 22.5 GB VRAM (fine-tuning with LoRA) โข โฅ 70 GB VRAM (full fine-tuning) |
- Clone the repository with submodules:
git clone --recurse-submodules https://github.com/OpenDriveLab/EgoHumanoid.git
cd EgoHumanoid- Install uv package manager:
# See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh- Set up Python environment:
# Create environment and install dependencies
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .- Set up GR00T WholeBodyControl for robot control (for robot data collection only):
# Clone GR00T WholeBodyControl
cd data_collection/robot_data
git clone https://github.com/NVlabs/GR00T-WholeBodyControl.git
cd GR00T-WholeBodyControl/decoupled_wbc
# Copy teleoperation scripts
cp -r ../../teleop/ control/main/teleop
# Set up Docker environment
# Modify docker/run_docker.sh to mount src/openpi directory
# See data_collection/robot_data/README.md for detailed setup instructions
# Install and start Docker container
./docker/run_docker.sh --install --root- Install ZED SDK (for data collection/processing):
Follow the ZED SDK installation guide for your platform, then install the Python API:
python /usr/local/zed/get_python_api.py- Install PICO SDK (for human data collection):
Follow the XR Robotics guidelines to set up the PICO SDK and XRoboToolkit-PC-Service.
For detailed hardware setup and collection procedures, see:
Robot data collection uses the Unitree G1 humanoid with teleoperation control and synchronized camera recording.
1. Set up Docker environment:
cd data_collection/robot_data/GR00T-WholeBodyControl/decoupled_wbc
# Modify docker/run_docker.sh to mount src/openpi directory
# See data_collection/robot_data/README.md for details
# Install and start Docker container
./docker/run_docker.sh --install --root
./docker/run_docker.sh --root2. Inside Docker container, run G1 control:
# For real robot (ensure network is configured)
python decoupled_wbc/control/main/teleop/run_g1_control_loop.py --interface real --control-frequency 50 --with_hands3. In a separate terminal, run teleoperation:
python decoupled_wbc/control/main/teleop/run_teleop_policy_loop.py --body-control-device pico --hand_control_device=pico --enable_real_device4. Start data collection:
python decoupled_wbc/control/main/teleop/zed_mini_run_g1_data_exporter.py --dataset-name <task_name> --visualizeController Bindings:
Menu + Left Trigger: Toggle lower-body policyMenu + Right Trigger: Toggle upper-body policyLeft Stick: X/Y translationRight Stick: Yaw rotationL/R Triggers: Control hand grippersA Button: Start collecting episodeB Button: Discard episode
Output:
data_collection/<task_name>/episode_*.hdf5- Robot state, actions, and navigation commandsdata_collection/<task_name>/episode_*.svo2- ZED camera recordings
See data_collection/robot_data/README.md for detailed instructions.
Human data collection captures synchronized full-body motion and binocular camera views.
1. Set up data collection environment:
# Create conda environment
conda create -n humandata python=3.11
conda activate humandata
# Install dependencies
pip install -r data_collection/human_data/requirements.txt2. Start data collection:
cd data_collection/human_data
# Basic collection
python scripts/human_data_collection.py --name <dataset_name>
# With ZED camera preview
python scripts/human_data_collection.py --name <dataset_name> --visualize-zed
# Specify save directory
python scripts/human_data_collection.py --data-dir <save_dir> --name <dataset_name>3. Collection workflow:
- System initializes (PICO SDK + ZED Mini + MeshCat visualization)
- Open browser at
http://localhost:7000/static/to view 3D skeleton - Enter episode index (e.g., 0, 1, 2...)
- Perform demonstration
- Press Space to finish episode
- Data is saved automatically
- Continue to next episode or press Ctrl+C to exit
Output:
<data_dir>/<dataset_name>/episode_*.hdf5- Body pose, hand pose, controller pose, timestamps<data_dir>/<dataset_name>/episode_*.svo2- ZED Mini video with depth
See data_collection/human_data/README.md for detailed instructions.
For detailed pipeline documentation, see:
The human data processing pipeline transforms raw VR recordings into robot-compatible datasets.
Run the full pipeline:
cd data_alignment/human_data_process
./run_human_data_pipeline.sh \
--input_dir /path/to/raw_data \
--output_dir /path/to/intermediate \
--final-output-dir /path/to/final \
--file allPipeline stages:
- Reorder Episodes: Sort chronologically and renumber
- Navigation Pipeline: Generate velocity commands from body pose
- Downsample: Reduce frequency and discretize commands
- Merge Camera: Integrate ZED camera frames
- Hand Status: Compute binary hand open/close status
Advanced usage:
# Skip stages
./run_human_data_pipeline.sh \
--input_dir /path/to/raw \
--output_dir /path/to/processed \
--final-output-dir /path/to/final \
--skip-reorder \
--skip-merge
# Generate validation plots
./run_human_data_pipeline.sh \
--input_dir /path/to/raw \
--output_dir /path/to/processed \
--final-output-dir /path/to/final \
--with-png
# Dry run (preview commands)
./run_human_data_pipeline.sh \
--input_dir /path/to/raw \
--output_dir /path/to/processed \
--final-output-dir /path/to/final \
--dry-runSee data_alignment/human_data_process/README.md for detailed pipeline documentation.
Process robot demonstration data:
cd data_alignment/robot_data_process
python merge_data.py \
--dataset-dir /path/to/robot/data \
--output-dir /path/to/processed/outputTransform egocentric camera viewpoints to match robot's perspective using depth-based warping and inpainting.
Process single HDF5 file:
cd data_alignment/view_alignment
python viewport_transform_batch_h5.py \
--h5_file /path/to/input.h5 \
--image_key "observation_image_left" \
--trajectory "down" \
--movement_distance 0.07 \
--output_dir ./outputProcess directory (multi-GPU):
python viewport_transform_batch_h5.py \
--h5_dir /path/to/h5_directory \
--batch_size 32 \
--trajectory "down" \
--movement_distance 0.07 \
--num_gpus 4 \
--output_dir /path/to/outputTrajectory options: left, right, up, down, forward, backward
See data_alignment/view_alignment/README.md for more details.
Convert processed HDF5 datasets to LeRobot format for training:
cd data_alignment
# Single-threaded
python convert_to_lerobot.py \
--src-path /path/to/processed/data \
--output-path /path/to/lerobot/data \
--repo-id my_dataset \
--fps 20 \
--task "task description"
# Multi-threaded (faster)
python convert_to_lerobot.py \
--src-path /path/to/processed/data \
--output-path /path/to/lerobot/data \
--repo-id my_dataset \
--num-workers 16 \
--fps 20 \
--task "task description"Before training, compute normalization statistics for your dataset:
uv run python scripts/compute_norm_states_ultra_fast.py --config-name=norm_computeTrain the model using the computed normalization statistics:
# Set XLA memory fraction for better GPU utilization
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py <config_name> --exp_name=<experiment_name>Examples:
# Train on your custom dataset
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_g1_custom --exp_name=my_experiment
# Multi-GPU training with FSDP
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_g1_custom --exp_name=my_experiment --fsdp-devices 4Checkpoints are saved to checkpoints/<config_name>/<exp_name>/ during training. Training progress is logged to the console and Weights & Biases.
Start a policy server for remote inference:
# Use a trained checkpoint
uv run scripts/serve_policy.py policy:checkpoint \
--policy.config=<config_name> \
--policy.dir=checkpoints/<config_name>/<exp_name>/<iteration>The server will listen on port 8000 by default.
The deployment client connects to the OpenPI policy server via websocket for action inference and controls the G1 robot via the GR00T WBC framework.
On the robot/client side:
# Inside the GR00T Docker container
cd /root/Projects/openpi
# Run the deployment client
python scripts/deploy.py --host <server_ip> --port 8000๐๏ธ Keyboard Controls:
| Key | Action |
|---|---|
] |
|
p |
๐ฏ Enter preparation phase (move to initial pose) |
c |
๐๏ธ Toggle left hand open/close (right hand stays open) |
l |
โฏ๏ธ Start/pause inference loop |
[ |
๐ Enter silent mode (slowly return to initial pose) |
o |
๐ Deactivate policy (emergency stop) |
Ctrl+C |
โ Exit program |
๐ Workflow:
graph LR
A[Start Policy Server<br/>on GPU Host] --> B[Start G1 Robot<br/>Control Loop]
B --> C[Run Deployment<br/>Client]
C --> D[Use Keyboard<br/>Controls]
D --> E[Robot Execution]
Example Python API:
from openpi.training import config as _config
from openpi.policies import policy_config
# Load policy
config = _config.get_config("pi05_g1_custom")
checkpoint_dir = "checkpoints/pi05_g1_custom/exp1/100000"
policy = policy_config.create_trained_policy(config, checkpoint_dir)
# Run inference
observation = {
"observation/exterior_image_1_left": camera_left_image,
"observation/wrist_image_left": wrist_image,
"observation/state": joint_positions,
"prompt": "pick up the object"
}
action_chunk = policy.infer(observation)["actions"]
# Execute on robot
robot.execute_action(action_chunk[0])For detailed deployment instructions including camera setup, robot initialization, and troubleshooting, see the comments in scripts/deploy.py.
| ๐ป Component | ๐ฎ GPU Memory | ๐ง Example Hardware |
|---|---|---|
| Inference | > 8 GB VRAM | RTX 4090 |
| Fine-tuning (LoRA) | > 22.5 GB VRAM | RTX 4090 |
| Fine-tuning (Full) | > 70 GB VRAM | A100 80GB / H100 |
| Robot Control | N/A | Ubuntu 22.04 PC |
| Human Data Collection | N/A | Ubuntu 22.04 + USB 3.0 |
If you find EgoHumanoid useful in your research, please consider citing:
@article{shi2026egohumanoid,
title={EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration},
author={Shi, Modi and Peng, Shijia and Chen, Jin and Jiang, Haoran and Li, Yinghui and Huang, Di and Luo, Ping and Li, Hongyang and Chen, Li},
journal={arXiv preprint arXiv:2602.10106},
year={2026}
}โญ If you find this project helpful, please consider giving it a star! โญ
This project is licensed under the Apache 2.0 License.
The OpenPI models and code are provided by Physical Intelligence under the Apache 2.0 License.
We sincerely thank the following projects and teams:
|
Vision-language-action models |
Humanoid control framework |
PICO VR integration |
ZED camera SDK |