Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

Nghia Nguyen,Minh Nhat Vu,Tung D. Ta,Baoru Huang,Thieu Vo,Ngan Le,Anh Nguyen
2024-09-26
Abstract:Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (~7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP's strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to address the current visual language models (such as CLIP) and their insufficient understanding of dynamic actions when handling robotic tasks. Specifically: 1. **Problem Background**: Existing visual language models (VLMs), such as CLIP, are primarily trained on datasets that pair static images with text. As a result, they perform well in handling static images but poorly in dealing with time-related data such as videos or action sequences. 2. **Research Motivation**: To enable robots to better understand and execute language-based tasks, a method that can effectively capture action information is needed. However, the existing CLIP and its variants mainly focus on static data and lack the ability to understand actions. 3. **Solution**: The authors propose a new model—Robotic-CLIP, which enhances CLIP's ability to understand actions by fine-tuning it on a large-scale action dataset. This model not only retains CLIP's advantages in image processing but also understands action descriptions in a robotic context. 4. **Main Contributions**: - Introduced the Robotic-CLIP model, specifically designed for language-based robotic tasks. - Proposed a method to generate a large-scale action dataset and developed a new fine-tuning technique to enable the model to deeply understand actions. - Conducted extensive experimental validation on various robotic tasks, demonstrating the model's effectiveness. With these improvements, Robotic-CLIP can achieve better performance than other CLIP variants in various language-driven robotic tasks, particularly in grasp detection, policy learning, and robotic navigation.