Enhancing Robot Manipulation Skill Learning with Multi-task Capability Based on Transformer and Token Reduction

Ren‐Jie Han,Naijun Liu,Chang Liu,Tianyu Gou,Fuchun Sun
2023-01-01
Abstract:Learning skills from videos based on language instructions is an innate ability for humans. However, robots face significant challenge in establishing connections between language instructions and visual observations, exhibiting precise control, and retaining memory of previous actions. Additionally, multi-task learning is difficult due to potential interference between tasks. To address these issues, we propose an algorithm called CSATO, based on Transformer network architecture for robot skill learning from videos and achieving multi-task learning with given instructions. Our algorithm consists of a visual-language fusion network, a token reduction network, and a Transformer decoder network, which is trained by predicting actions for corresponding states. The fusion network facilitates the integration of information from different modalities, while the token reduction network reduces the number of tokens passed to the Transformer using channel and spatial attention mechanisms. Finally, the Transformer comprehensively models the relationships among instructions, current and historical visual observations, and generate autoregressive action predictions. To improve multi-task learning performance, learnable offsets parameters are introduced for each task in the final action prediction stage. The effectiveness of our approach is demonstrated through sereral continuous manipulation task, exhibiting the effectiveness of the proposed algorithm.
What problem does this paper attempt to address?