Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation

Xuanlin Li,Tong Zhao,Xinghao Zhu,Jiuguang Wang,Tao Pang,Kuan Fang
2024-12-04
Abstract:Contact-rich bimanual manipulation involves precise coordination of two arms to change object states through strategically selected contacts and motions. Due to the inherent complexity of these tasks, acquiring sufficient demonstration data and training policies that generalize to unseen scenarios remain a largely unresolved challenge. Building on recent advances in planning through contacts, we introduce Generalizable Planning-Guided Diffusion Policy Learning (GLIDE), an approach that effectively learns to solve contact-rich bimanual manipulation tasks by leveraging model-based motion planners to generate demonstration data in high-fidelity physics simulation. Through efficient planning in randomized environments, our approach generates large-scale and high-quality synthetic motion trajectories for tasks involving diverse objects and transformations. We then train a task-conditioned diffusion policy via behavior cloning using these demonstrations. To tackle the sim-to-real gap, we propose a set of essential design options in feature extraction, task representation, action prediction, and data augmentation that enable learning robust prediction of smooth action sequences and generalization to unseen scenarios. Through experiments in both simulation and the real world, we demonstrate that our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties. Website: <a class="link-external link-https" href="https://glide-manip.github.io/" rel="external noopener nofollow">this https URL</a>
Robotics,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve **the challenges of complex object manipulation by dual - arm robots in contact - rich environments**. Specifically, the paper focuses on how to enable the robot to change the state of an object by coordinating two robotic arms through multiple contact points, especially when facing objects with diverse geometric shapes and physical properties. Due to their inherent complexity (such as requiring long - term multi - stage contact and manipulation), obtaining sufficient demonstration data and training strategies that can generalize to unseen scenarios remain an unsolved difficult problem. #### Main problems include: 1. **Obtaining high - quality demonstration data**: For complex dual - arm manipulation tasks, collecting expert demonstration data in the real world is both difficult and expensive. 2. **The gap between simulation and the real world (Sim - to - Real Gap)**: Strategies trained with simulation data face differences in perception and dynamic characteristics when deployed in the real world. 3. **Generalization ability**: Ensure that the learned strategies can be applied to unseen objects and environments, not just the specific objects in the training set. To solve these problems, the paper proposes the **Generalizable Planning - Guided Diffusion Policy Learning (GLIDE)** method. The core idea of GLIDE is to use a model - based motion planner to generate large - scale, high - quality synthetic trajectory data in high - fidelity physical simulations, and train a conditional diffusion policy network through behavior cloning so that it can predict a smooth sequence of actions according to the observed point cloud and task description. In addition, GLIDE also introduces a series of design choices to enhance the strategy's transfer ability from the network to the real world and its generalization ability to unseen scenarios. ### Formula summary - **Objective function**: \[ \min_{q_u^+, a} (q_u^+ - q_u^{\text{goal}})^T Q (q_u^+ - q_u^{\text{goal}})+(a - q_a)^T R (a - q_a) \] where \( q_u^+ = f_{\text{local}}(q_u, q_a, a) \) represents the approximate configuration of the object after the robot executes action \( a \), and \( Q \) and \( R \) are user - specified cost matrices. - **Action sequence prediction**: \[ a_{t + 1:t+T_a}=\{q_i - q_t\}_{i = t + 1}^{t+T_a} \] where \( T_a \) is the predicted time step and \( q_t \) is the current joint position. ### Conclusion The GLIDE method successfully solves multiple challenges in contact - rich dual - arm manipulation tasks by combining efficient motion planning and diffusion policy learning, demonstrating its effectiveness and generalization ability in both simulation and the real world.