Abstract:Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at <a class="link-external link-https" href="https://grape-vla.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the insufficient generalization ability of current Vision - Language - Action (VLA) models in robotic tasks. Specifically, although existing VLA models perform well in a variety of robotic tasks, they show poor generalization ability when facing unseen tasks, environments, objects or semantic contexts. This is mainly because these models mainly rely on behavior cloning, that is, learning by imitating successful behavior trajectories, without developing an overall understanding of task goals or an awareness of potential failure modes. In addition, these models usually need to be fine - tuned with demonstration data collected by experts in different settings, thus introducing distribution bias and limiting their adaptability to diverse operational goals (such as efficiency, safety and task completion). To solve these problems, the paper proposes GRAPE (Generalizing Robot Policy via Preference Alignment), a new method aimed at improving the generalization ability of VLA models through preference alignment. GRAPE achieves this goal in the following ways: 1. **Trajectory - level preference alignment**: GRAPE models rewards not only from successful trials but also from failed trials to enhance generalization ability for different tasks. 2. **Multi - stage decomposition**: Decompose complex manipulation tasks into multiple independent stages, and automatically guide preference modeling through large - scale vision - language models, using key points to propose spatio - temporal constraints. 3. **Flexible goal alignment**: These constraints are flexible and can be customized according to different operational goals (such as safety, efficiency or task success). Through these methods, GRAPE can improve the success rate of VLA models on new tasks, and can reduce the collision rate and the number of steps in task execution while maintaining a high success rate, thereby improving the operational efficiency and safety of robots. Experimental results show that, compared with the existing state - of - the - art VLA models, GRAPE increases the success rate by 51.79% and 60.36% on in - domain and unseen tasks respectively, while reducing the collision rate by 44.31% in terms of safety and reducing the number of steps by 11.15% in terms of efficiency.

GRAPE: Generalizing Robot Policy via Preference Alignment

Ensemble Bootstrapped Deep Deterministic Policy Gradient For Vision-Based Robotic Grasping

Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance

OpenVLA: An Open-Source Vision-Language-Action Model

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

PEARL: Zero-shot Cross-task Preference Alignment and Robust Reward Learning for Robotic Manipulation

GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy

Spatial-Language Attention Policies for Efficient Robot Learning

Leveraging the Efficiency of Multi-Task Robot Manipulation Via Task-Evoked Planner and Reinforcement Learning

CAGE: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation

VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes

Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study

MoVie: Visual Model-Based Policy Adaptation for View Generalization

Task Success is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

Affordance-Guided Reinforcement Learning via Visual Prompting

Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots

Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression