Masked Visual-Tactile Pre-training for Robot Manipulation

Liu Qingtao,Ye Qi,Sun Zhengnan,Cui Yu,Li Gaofeng,Chen Jiming
DOI: https://doi.org/10.1109/icra57147.2024.10610933
2024-01-01
Abstract:Recent works on the pretraining for robot manipulation have demonstrated that representations learning from large human manipulation data can generalize well to new manipulation tasks and environments. However, these approaches mainly focus on human vision or natural language, neglecting tactile feedback. In this article, we make an attempt to explore how to pre-train a representation model for robotic manipulation using both human manipulation visual and tactile data. We develop a system for collecting visual and tactile data, featuring a cost-effective tactile glove to capture human tactile data and Hololens2 for capturing visual data. With this system, we collect a dataset of turning bottle caps. Furthermore, we introduce a novel visual-tactile fusion network and learning strategy, with one key module to tokenize 20 sparse binary tactile signals sensing touch states for the learning of tactile context and the other key module applying the attention and mask mechanism to the interaction of visual and tactile tokens for visual-tactile representation learning. We utilize our dataset to pre-train the fusion model and embed the pre-trained model into a reinforcement learning framework for downstream tasks. Experimental results demonstrate that our pre-trained model significantly aids in learning manipulation skills. Compared to methods without pre-training, our approach achieves a success rate increase of over 60%. Additionally, when compared to current visual pre-training methods, our success rate exceeds them by more than 50%.
What problem does this paper attempt to address?