Abstract:Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at <a class="link-external link-https" href="https://grape-vla.github.io/" rel="external noopener nofollow">this https URL</a>

Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

GRAPE: Generalizing Robot Policy via Preference Alignment

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

Accelerated Robot Learning via Human Brain Signals

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

A Study on Dense and Sparse (Visual) Rewards in Robot Policy Learning

RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

PEARL: Zero-shot Cross-task Preference Alignment and Robust Reward Learning for Robotic Manipulation

Data-efficient Deep Reinforcement Learning Method Toward Scaling Continuous Robotic Task with Sparse Rewards.

Learning a Universal Human Prior for Dexterous Manipulation from Human Preference

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Affordance-Guided Reinforcement Learning via Visual Prompting

ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

End-to-End Robotic Reinforcement Learning without Reward Engineering

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

LIRE: listwise reward enhancement for preference alignment