Visual Imitation Learning of Non-Prehensile Manipulation Tasks with Dynamics-Supervised Models

Abdullah Mustafa,Ryo Hanai,Ixchel Ramirez,Floris Erich,Ryoichi Nakajo,Yukiyasu Domae,Tetsuya Ogata
2024-10-25
Abstract:Unlike quasi-static robotic manipulation tasks like pick-and-place, dynamic tasks such as non-prehensile manipulation pose greater challenges, especially for vision-based control. Successful control requires the extraction of features relevant to the target task. In visual imitation learning settings, these features can be learnt by backpropagating the policy loss through the vision backbone. Yet, this approach tends to learn task-specific features with limited generalizability. Alternatively, learning world models can realize more generalizable vision backbones. Utilizing the learnt features, task-specific policies are subsequently trained. Commonly, these models are trained solely to predict the next RGB state from the current state and action taken. But only-RGB prediction might not fully-capture the task-relevant dynamics. In this work, we hypothesize that direct supervision of target dynamic states (Dynamics Mapping) can learn better dynamics-informed world models. Beside the next RGB reconstruction, the world model is also trained to directly predict position, velocity, and acceleration of environment rigid bodies. To verify our hypothesis, we designed a non-prehensile 2D environment tailored to two tasks: "Balance-Reaching" and "Bin-Dropping". When trained on the first task, dynamics mapping enhanced the task performance under different training configurations (Decoupled, Joint, End-to-End) and policy architectures (Feedforward, Recurrent). Notably, its most significant impact was for world model pretraining boosting the success rate from 21% to 85%. Although frozen dynamics-informed world models could generalize well to a task with in-domain dynamics, but poorly to a one with out-of-domain dynamics.
Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in visual imitation learning, how to more effectively handle dynamic non - prehensile manipulation tasks. Specifically, the authors focus on tasks that require considering the dynamic characteristics of objects, such as balancing and moving without grasping the objects. These tasks are especially challenging for vision - based control because successful control requires extracting task - relevant features from visual input. ### Main problems 1. **Limitations of existing methods**: - In visual imitation learning, current methods tend to learn task - specific features through back - propagation of policy loss, which results in limited generalization ability. - Predicting only the next - frame RGB image may not be able to fully capture the task - relevant dynamic characteristics. 2. **Proposed new method**: - The authors propose "Dynamics Mapping", that is, directly supervising the dynamic states (such as position, velocity, and acceleration) of the target when training the world model to learn more informative dynamic features. - In this way, the model can better understand and predict the dynamic changes in the environment, thereby improving task performance and generalization ability. ### Verification of the hypothesis To verify this hypothesis, the authors designed a two - dimensional non - prehensile environment and conducted experiments for two tasks: - **Balance - Reaching**: Make the cart reach the target position without dropping the object. - **Bin - Dropping**: Tilt the object and put it into the green bin. ### Experimental results - **Significant pre - training effect**: Dynamics Mapping significantly improves the pre - training effect of the world model, with a success rate of 85%, while the success rate of using only RGB prediction is only 21%. - **Performance under different configurations**: Dynamics Mapping performs well under different training configurations (Decoupled, Joint, End - to - End) and policy architectures (Feedforward, Recurrent). - **Generalization ability**: The frozen dynamics - informed world model shows good generalization ability in tasks with similar dynamics, but performs poorly in tasks with different dynamics. ### Summary The main contribution of this paper is to propose a new training method - Dynamics Mapping, to enhance the learning and utilization of dynamic features in visual imitation learning. This method not only improves task performance, but also enhances the model's generalization ability, especially when dealing with complex dynamic manipulation tasks.