Abstract:Humans rely on the synergy of their senses for most essential tasks. For tasks requiring object manipulation, we seamlessly and effectively exploit the complementarity of our senses of vision and touch. This paper draws inspiration from such capabilities and aims to find a systematic approach to fuse visual and tactile information in a reinforcement learning setting. We propose Masked Multimodal Learning (M3L), which jointly learns a policy and visual-tactile representations based on masked autoencoding. The representations jointly learned from vision and touch improve sample efficiency, and unlock generalization capabilities beyond those achievable through each of the senses separately. Remarkably, representations learned in a multimodal setting also benefit vision-only policies at test time. We evaluate M3L on three simulated environments with both visual and tactile observations: robotic insertion, door opening, and dexterous in-hand manipulation, demonstrating the benefits of learning a multimodal policy. Code and videos of the experiments are available at <a class="link-external link-https" href="https://sferrazza.cc/m3l_site" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to effectively integrate visual and tactile information in robotic manipulation tasks to improve sample efficiency and generalization capability. Specifically, the authors propose a method called Masked Multimodal Learning (M3L), which employs a multimodal masked autoencoder (MAE) joint learning strategy for visual-tactile representation. This approach aims to overcome the limitations of existing methods that handle visual or tactile information separately, thereby achieving better performance in complex environments. ### Background of the Paper - **Human Sensory Synergy**: Humans can seamlessly utilize the complementarity of vision and touch when performing tasks that require object manipulation. For example, during the initial stage of grasping an object, vision is primarily relied upon, while tactile feedback becomes more important after contact. - **Current State of Robotic Manipulation**: In existing robotic manipulation research, vision and touch are usually studied independently. Although vision-based methods have made significant progress, incorporating tactile feedback can further enhance the robot's manipulation capabilities, especially in dealing with visual occlusion, handling fragile objects, and improving precision. ### Proposed Method - **Masked Multimodal Learning (M3L)**: M3L jointly learns visual and tactile representations through a multimodal masked autoencoder (MAE) and trains reinforcement learning strategies based on this. The specific steps include: - **Representation Learning**: Using MAE to learn compact representations from raw visual and tactile data by optimizing reconstruction loss. - **Policy Learning**: Using the Proximal Policy Optimization (PPO) algorithm to train policies based on the learned multimodal representations. ### Experimental Setup - **Tactile Insertion**: The robot needs to insert a pin into a target frame. The experiment includes 18 different training pin shapes and 2 unseen test pin shapes. - **Door Opening Task**: The robot needs to open a locked door by turning the doorknob and pulling the door. The position and friction coefficient of the door are randomly initialized in the experiment. - **In-Hand Cube Rotation**: The robot needs to reposition a colored cube to a predefined configuration. The experiment introduces slight perturbations in the cube's mass and the camera position. ### Experimental Results - **Generalization Capability**: M3L demonstrates stronger generalization capability when dealing with unseen objects or scene changes, particularly in the tactile insertion and in-hand cube rotation tasks. - **Sample Efficiency**: M3L shows higher sample efficiency in all tasks, especially in the door opening task. - **Improvement in Visual Strategy**: Even when only using visual information during testing, the representation encoder trained by M3L significantly improves the performance of visual strategies. ### Conclusion This paper demonstrates how to effectively integrate visual and tactile information in robotic manipulation tasks through the M3L method, improving sample efficiency and generalization capability. This method not only performs well in multimodal settings but also maintains good performance when using only visual information, providing new possibilities for practical applications.

The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning

Masked Visual-Tactile Pre-training for Robot Manipulation

Multimodal Masked Autoencoders Learn Transferable Representations

MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation

Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

Bridging vision and touch: advancing robotic interaction prediction with self-supervised multimodal learning

Exemplar Masking for Multimodal Incremental Learning

Learning Precise, Contact-Rich Manipulation through Uncalibrated Tactile Skins

An Efficient Generalizable Framework for Visuomotor Policies via Control-aware Augmentation and Privilege-guided Distillation

M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic Manipulation

How to Sense the World: Leveraging Hierarchy in Multimodal Perception for Robust Reinforcement Learning Agents

Visual-Tactile Multimodality for Following Deformable Linear Objects Using Reinforcement Learning

Learning Self-Supervised Representations from Vision and Touch for Active Sliding Perception of Deformable Surfaces

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

Learning with Unmasked Tokens Drives Stronger Vision Learners

Combining Vision and Tactile Sensation for Video Prediction

Multi-Modal Fusion in Contact-Rich Precise Tasks via Hierarchical Policy Learning

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling