The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning

Carmelo Sferrazza,Younggyo Seo,Hao Liu,Youngwoon Lee,Pieter Abbeel
2023-11-02
Abstract:Humans rely on the synergy of their senses for most essential tasks. For tasks requiring object manipulation, we seamlessly and effectively exploit the complementarity of our senses of vision and touch. This paper draws inspiration from such capabilities and aims to find a systematic approach to fuse visual and tactile information in a reinforcement learning setting. We propose Masked Multimodal Learning (M3L), which jointly learns a policy and visual-tactile representations based on masked autoencoding. The representations jointly learned from vision and touch improve sample efficiency, and unlock generalization capabilities beyond those achievable through each of the senses separately. Remarkably, representations learned in a multimodal setting also benefit vision-only policies at test time. We evaluate M3L on three simulated environments with both visual and tactile observations: robotic insertion, door opening, and dexterous in-hand manipulation, demonstrating the benefits of learning a multimodal policy. Code and videos of the experiments are available at <a class="link-external link-https" href="https://sferrazza.cc/m3l_site" rel="external noopener nofollow">this https URL</a>.
Robotics,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the problem of how to effectively integrate visual and tactile information in robotic manipulation tasks to improve sample efficiency and generalization capability. Specifically, the authors propose a method called Masked Multimodal Learning (M3L), which employs a multimodal masked autoencoder (MAE) joint learning strategy for visual-tactile representation. This approach aims to overcome the limitations of existing methods that handle visual or tactile information separately, thereby achieving better performance in complex environments. ### Background of the Paper - **Human Sensory Synergy**: Humans can seamlessly utilize the complementarity of vision and touch when performing tasks that require object manipulation. For example, during the initial stage of grasping an object, vision is primarily relied upon, while tactile feedback becomes more important after contact. - **Current State of Robotic Manipulation**: In existing robotic manipulation research, vision and touch are usually studied independently. Although vision-based methods have made significant progress, incorporating tactile feedback can further enhance the robot's manipulation capabilities, especially in dealing with visual occlusion, handling fragile objects, and improving precision. ### Proposed Method - **Masked Multimodal Learning (M3L)**: M3L jointly learns visual and tactile representations through a multimodal masked autoencoder (MAE) and trains reinforcement learning strategies based on this. The specific steps include: - **Representation Learning**: Using MAE to learn compact representations from raw visual and tactile data by optimizing reconstruction loss. - **Policy Learning**: Using the Proximal Policy Optimization (PPO) algorithm to train policies based on the learned multimodal representations. ### Experimental Setup - **Tactile Insertion**: The robot needs to insert a pin into a target frame. The experiment includes 18 different training pin shapes and 2 unseen test pin shapes. - **Door Opening Task**: The robot needs to open a locked door by turning the doorknob and pulling the door. The position and friction coefficient of the door are randomly initialized in the experiment. - **In-Hand Cube Rotation**: The robot needs to reposition a colored cube to a predefined configuration. The experiment introduces slight perturbations in the cube's mass and the camera position. ### Experimental Results - **Generalization Capability**: M3L demonstrates stronger generalization capability when dealing with unseen objects or scene changes, particularly in the tactile insertion and in-hand cube rotation tasks. - **Sample Efficiency**: M3L shows higher sample efficiency in all tasks, especially in the door opening task. - **Improvement in Visual Strategy**: Even when only using visual information during testing, the representation encoder trained by M3L significantly improves the performance of visual strategies. ### Conclusion This paper demonstrates how to effectively integrate visual and tactile information in robotic manipulation tasks through the M3L method, improving sample efficiency and generalization capability. This method not only performs well in multimodal settings but also maintains good performance when using only visual information, providing new possibilities for practical applications.