Project Page · Jun 2026

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Hao Li1,2* Ganlong Zhao1,2*,† Yufei Liu1,4* Haotian Hou1,2* Guoquan Ye1,3 Tongyan Fang1,5 Chunxiao Liu1 Siyuan Huang1† Jianbo Liu1 Xiaogang Wang1,2 Hongsheng Li2,1✉
1ACE Robotics 2CUHK MMLab 3CUHK, Shenzhen 4SJTU 5THU

*Equal contribution · Project lead · Corresponding author

ACE-Ego-0 transfers mixed-source VLA pretraining to real bimanual manipulation with camera-space action prediction.

Overview

ACE-Ego-0 unifies egocentric human videos, multi-embodiment robot demonstrations, and simulation rollouts for VLA pretraining.

ACE-Ego-0 overview diagram
ACE-Ego-0 aligns spatial, embodiment, source-quality, and temporal axes across human egocentric videos, robot demonstrations, and simulation data.
6.0K+ hrs mixed embodied pretraining data
72.8% RoboCasa GR1 TableTop success
91.12 / 90.62% RoboTwin 2.0 Easy / Hard success
78.3% real bimanual ARX success

Abstract

Vision-language-action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Large-scale egocentric human videos provide complementary real-world supervision, but joint training on human and robot data is difficult because their action spaces, embodiment structures, temporal dynamics, and supervision quality do not match.

We introduce ACE-Ego-0, a unified VLA pretraining framework that converts human egocentric videos into robot-format pseudo-action trajectories, aligns heterogeneous sources through camera-space actions, morphology conditioning, and time-aligned action chunking, and trains with reliability-aware human auxiliary supervision. ACE-Ego-0 pretrains on 4.53K hours of robot and simulation data plus 1.48K hours of pseudo-action-labeled egocentric human data, achieving state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0 with strong real-world bimanual transfer.

Method Overview

ACE-Ego-0 resolves spatial, structural, temporal, and label-quality mismatches between human egocentric video and robot trajectories.

ACE-Ego-0 method architecture
The pretrained VLM feeds an action expert that predicts continuous camera-space action chunks with morphology conditioning and flow matching.

Camera-Space Actions

Robot end-effector trajectories and reconstructed human pseudo-actions are represented in the observation-centric camera frame.

Morphology Conditioning

Robot URDF graph embeddings and learned human surrogate tokens condition the action expert without changing the VLM backbone.

Reliability-Aware Training

Sensor-logged robot actions supervise the primary loss, while noisy human pseudo-actions contribute through quality-weighted auxiliary losses.

Data Construction

ACE-Ego-0 converts raw human egocentric videos into robot-compatible pseudo-action trajectories through curation, reconstruction, action extraction, and filtering.

ACE-Ego-0 human video data processing pipeline
The five-stage processing pipeline curates human video datasets, selects valid egocentric clips, reconstructs 3D hands, extracts gripper-style actions, and filters noisy pseudo-actions.

Human Video Coverage

Task-matched egocentric human videos provide complementary action coverage for data-scarce robot fine-tuning tasks.

Robot and human trajectory coverage
Sweep Cubes trajectory coverage: 419 human-video episodes cover 4.8x larger workspace area than 34 robot demonstrations.

Real-Robot Demonstrations

ACE-Ego-0 is deployed on a bimanual ARX platform with a head-mounted camera and camera-space delta end-effector commands.

Scoop Coffee
Category Sorting
Pick Tea / Arrange Drinks
Stack Bowls
Sweep Cubes
Pack Shoes

Results

ACE-Ego-0 improves simulation benchmarks and real bimanual manipulation against strong VLA baselines.

RoboCasa GR1 TableTop

Method Average Success
GR00T-N1.647.6
Qwen3PI43.9
FLARE55.0
ABot-M058.3
JoyAI-RA63.2
DIAL70.2
ACE-Ego-072.8

RoboTwin 2.0

Method Easy Hard
pi_0.582.7476.76
Motus88.6687.02
LingBot-VLA88.5686.68
ABot-M086.0685.08
JoyAI-RA90.4889.28
ACE-Ego-091.1290.62
Real robot success rates
Real-robot success rates across six ARX bimanual tasks and the average over 30 trials per task.

Citation

@misc{li2026aceego0unifyingegocentrichuman,
      title={ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining}, 
      author={Hao Li and Ganlong Zhao and Yufei Liu and Haotian Hou and Guoquan Ye and Tongyan Fang and Chunxiao Liu and Siyuan Huang and Jianbo Liu and Xiaogang Wang and Hongsheng Li},
      year={2026},
      eprint={2606.17200},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17200}, 
}