Project Page · Jun 2026

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Hao Li^1,2* Ganlong Zhao^1,2*,† Yufei Liu^1,4* Haotian Hou^1,2* Guoquan Ye^1,3 Tongyan Fang^1,5 Chunxiao Liu¹ Siyuan Huang^1† Jianbo Liu¹ Xiaogang Wang^1,2 Hongsheng Li^2,1✉

¹ACE Robotics ²CUHK MMLab ³CUHK, Shenzhen ⁴SJTU ⁵THU

^*Equal contribution · ^†Project lead · ^✉Corresponding author

Technical Report Code Coming Soon Data Coming Soon BibTeX

ACE-Ego-0 transfers mixed-source VLA pretraining to real bimanual manipulation with camera-space action prediction.

Overview

ACE-Ego-0 unifies egocentric human videos, multi-embodiment robot demonstrations, and simulation rollouts for VLA pretraining.

6.0K+ hrs mixed embodied pretraining data

72.8% RoboCasa GR1 TableTop success

91.12 / 90.62% RoboTwin 2.0 Easy / Hard success

78.3% real bimanual ARX success

Abstract

Vision-language-action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Large-scale egocentric human videos provide complementary real-world supervision, but joint training on human and robot data is difficult because their action spaces, embodiment structures, temporal dynamics, and supervision quality do not match.

We introduce ACE-Ego-0, a unified VLA pretraining framework that converts human egocentric videos into robot-format pseudo-action trajectories, aligns heterogeneous sources through camera-space actions, morphology conditioning, and time-aligned action chunking, and trains with reliability-aware human auxiliary supervision. ACE-Ego-0 pretrains on 4.53K hours of robot and simulation data plus 1.48K hours of pseudo-action-labeled egocentric human data, achieving state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0 with strong real-world bimanual transfer.

Method Overview

ACE-Ego-0 resolves spatial, structural, temporal, and label-quality mismatches between human egocentric video and robot trajectories.

ACE-Ego-0 method architecture — The pretrained VLM feeds an action expert that predicts continuous camera-space action chunks with morphology conditioning and flow matching.

Camera-Space Actions

Robot end-effector trajectories and reconstructed human pseudo-actions are represented in the observation-centric camera frame.

Morphology Conditioning

Robot URDF graph embeddings and learned human surrogate tokens condition the action expert without changing the VLM backbone.

Reliability-Aware Training

Sensor-logged robot actions supervise the primary loss, while noisy human pseudo-actions contribute through quality-weighted auxiliary losses.

Data Construction

ACE-Ego-0 converts raw human egocentric videos into robot-compatible pseudo-action trajectories through curation, reconstruction, action extraction, and filtering.

ACE-Ego-0 human video data processing pipeline — The five-stage processing pipeline curates human video datasets, selects valid egocentric clips, reconstructs 3D hands, extracts gripper-style actions, and filters noisy pseudo-actions.

Human Video Coverage

Task-matched egocentric human videos provide complementary action coverage for data-scarce robot fine-tuning tasks.

Robot and human trajectory coverage — Sweep Cubes trajectory coverage: 419 human-video episodes cover 4.8x larger workspace area than 34 robot demonstrations.

Real-Robot Demonstrations

ACE-Ego-0 is deployed on a bimanual ARX platform with a head-mounted camera and camera-space delta end-effector commands.

Scoop Coffee

Category Sorting

Pick Tea / Arrange Drinks

Stack Bowls

Sweep Cubes

Pack Shoes

Results

ACE-Ego-0 improves simulation benchmarks and real bimanual manipulation against strong VLA baselines.

RoboCasa GR1 TableTop

Method	Average Success
GR00T-N1.6	47.6
Qwen3PI	43.9
FLARE	55.0
ABot-M0	58.3
JoyAI-RA	63.2
DIAL	70.2
ACE-Ego-0	72.8

RoboTwin 2.0

Method	Easy	Hard
pi_0.5	82.74	76.76
Motus	88.66	87.02
LingBot-VLA	88.56	86.68
ABot-M0	86.06	85.08
JoyAI-RA	90.48	89.28
ACE-Ego-0	91.12	90.62

Real robot success rates — Real-robot success rates across six ARX bimanual tasks and the average over 30 trials per task.

Citation

@misc{li2026aceego0unifyingegocentrichuman,
      title={ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining}, 
      author={Hao Li and Ganlong Zhao and Yufei Liu and Haotian Hou and Guoquan Ye and Tongyan Fang and Chunxiao Liu and Siyuan Huang and Jianbo Liu and Xiaogang Wang and Hongsheng Li},
      year={2026},
      eprint={2606.17200},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17200}, 
}