M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic Manipulation

Fotios Lygerakis,Vedant Dave,Elmar Rueckert
2024-06-19
Abstract:One of the most critical aspects of multimodal Reinforcement Learning (RL) is the effective integration of different observation modalities. Having robust and accurate representations derived from these modalities is key to enhancing the robustness and sample efficiency of RL algorithms. However, learning representations in RL settings for visuotactile data poses significant challenges, particularly due to the high dimensionality of the data and the complexity involved in correlating visual and tactile inputs with the dynamic environment and task objectives. To address these challenges, we propose Multimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Our approach employs a novel multimodal self-supervised learning technique that learns efficient representations and contributes to faster convergence of RL algorithms. Our method is agnostic to the RL algorithm, thus enabling its integration with any available RL algorithm. We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation tasks. This is evidenced by faster convergence rates and higher cumulative rewards per episode, compared to standard RL algorithms without our representation learning approach.
Robotics,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of effectively integrating different observation modalities (such as visual and tactile data) in Multimodal Reinforcement Learning (MRL) to improve learning efficiency and sample efficiency of the algorithms. Specifically, the paper focuses on the following aspects: 1. **Representation of High-Dimensional Data**: Visual and tactile data usually have high dimensions and complexity, making it challenging to extract effective feature representations from these data. 2. **Association Between Modalities**: Another challenge is how to associate visual and tactile inputs with task goals in dynamic environments. 3. **Sample Efficiency**: Traditional reinforcement learning algorithms often require a large number of samples to converge when dealing with high-dimensional multimodal data, which limits their effectiveness in practical applications. To address these issues, the paper proposes a Multimodal Contrastive Unsupervised Reinforcement Learning method (M2CURL). M2CURL learns efficient multimodal representations through self-supervised learning techniques and integrates them into existing reinforcement learning algorithms, thereby accelerating the convergence speed of the algorithms and improving learning efficiency. ### Main Contributions 1. **Multimodal Representation Learning**: M2CURL utilizes contrastive learning techniques to learn efficient multimodal representations by computing intra-modal and inter-modal losses. 2. **Algorithm Agnosticism**: M2CURL can be seamlessly integrated into any existing reinforcement learning algorithm, enhancing its generality and flexibility. 3. **Experimental Validation**: The paper conducts experiments on the Tactile Gym 2 simulator, showing that M2CURL significantly improves learning efficiency and performance in various manipulation tasks, evidenced by faster convergence speeds and higher cumulative rewards. ### Experimental Results 1. **Sample Efficiency**: At 100k and 500k environment steps, M2CURL demonstrates higher sample efficiency compared to baseline algorithms, especially in the early stages of learning. 2. **Performance Improvement**: M2CURL achieves better performance in multiple tasks (such as object pushing, edge following, and surface following), particularly excelling in complex tasks. 3. **Importance of Modality Fusion**: The experimental results also show that the fusion of multimodal data (combining visual and tactile data) significantly outperforms unimodal methods. In summary, by proposing the M2CURL method, the paper effectively addresses the issues of representation learning and sample efficiency in multimodal reinforcement learning, providing a new solution for robotic manipulation tasks.