Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

HABC fine-tunes vision-language-action policies with hierarchical advantage weighting, turning sparse rollout outcomes into transition-level supervision for real-world robot manipulation.

Tongyan Fang1,2 Siyuan Huang1† Naiyu Fang1,3 Ganlong Zhao1,3 Zhongjin Luo1,3 Jianbo Liu1 Xiaogang Wang1 Ying Dong2 ✉ Hongsheng Li1,3 ✉

1ACE Robotics    2Shenzhen International Graduate School, Tsinghua University
3The Chinese University of Hong Kong
Project leader    Corresponding authors

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes: Overview

Abstract

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate gt merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

Consecutive Task Successes

These videos show consecutive rollouts from the final HABC policy on three contact-rich bimanual manipulation tasks. The evaluation order is Pencil Pouch, Paper Bag, then Snack Bag, where HABC achieves 6/6, 6/6, and 4/4 consecutive successes respectively.

1. Pencil Pouch6/6
2. Paper Bag6/6
3. Snack Bag4/4

SFT Failure vs. HABC Recovery

The SFT baseline often fails after entering off-demonstration states. After online fine-tuning, HABC learns recovery-aware actions from sparse outcomes and intervention data, allowing the policy to correct these failure modes and complete the task.

Pencil Pouch
SFT baselineFailure
HABCRecovery
Paper Bag
SFT baselineFailure
HABCRecovery
Snack Bag
SFT baselineFailure
HABCRecovery

Method Overview

HABC treats a sparse episode outcome as two different learning signals. The viability head estimates whether a state can still lead to success, while the efficiency head estimates progress toward faster completion once success is reachable. A state-adaptive gate combines their one-step advantages into a bounded per-transition weight for the VLA flow-matching loss.

Overview of Hierarchical Advantage-Weighted Behavior Cloning

Real-Robot Results

The evaluation covers three contact-rich bimanual manipulation tasks with deformable objects. Compared with supervised fine-tuning, HABC substantially improves task success by learning recovery-aware behavior from online rollouts and human interventions.

36% → 92% Paper Bag success rate
44% → 88% Pencil Pouch success rate
12% → 38% Snack Bag success rate
Main results across three tasks

Citation

@misc{fang2026habc,
      title={Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes}, 
      author={Tongyan Fang and Siyuan Huang and Naiyu Fang and Ganlong Zhao and Zhongjin Luo and Jianbo Liu and Xiaogang Wang and Ying Dong and Hongsheng Li},
      year={2026},
      eprint={2606.17043},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17043}, 
}