Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

Nghia Nguyen,Minh Nhat Vu,Tung D. Ta,Baoru Huang,Thieu Vo,Ngan Le,Anh Nguyen

2024-09-26

Abstract:Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (~7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP's strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.

Robotics,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main goal of this paper is to address the current visual language models (such as CLIP) and their insufficient understanding of dynamic actions when handling robotic tasks. Specifically: 1. **Problem Background**: Existing visual language models (VLMs), such as CLIP, are primarily trained on datasets that pair static images with text. As a result, they perform well in handling static images but poorly in dealing with time-related data such as videos or action sequences. 2. **Research Motivation**: To enable robots to better understand and execute language-based tasks, a method that can effectively capture action information is needed. However, the existing CLIP and its variants mainly focus on static data and lack the ability to understand actions. 3. **Solution**: The authors propose a new model—Robotic-CLIP, which enhances CLIP's ability to understand actions by fine-tuning it on a large-scale action dataset. This model not only retains CLIP's advantages in image processing but also understands action descriptions in a robotic context. 4. **Main Contributions**: - Introduced the Robotic-CLIP model, specifically designed for language-based robotic tasks. - Proposed a method to generate a large-scale action dataset and developed a new fine-tuning technique to enable the model to deeply understand actions. - Conducted extensive experimental validation on various robotic tasks, demonstrating the model's effectiveness. With these improvements, Robotic-CLIP can achieve better performance than other CLIP variants in various language-driven robotic tasks, particularly in grasp detection, policy learning, and robotic navigation.

Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

CLIP feature-based randomized control using images and text for multiple tasks and robots

ActionCLIP: A New Paradigm for Video Action Recognition

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

How Much Can CLIP Benefit Vision-and-Language Tasks?

Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

CLIP-Motion: Learning Reward Functions for Robotic Actions Using Consecutive Observations

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Improving CLIP Training with Language Rewrites

Improving CLIP Robustness with Knowledge Distillation and Self-Training

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Contrastive Language, Action, and State Pre-training for Robot Learning

CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

EPK-CLIP: External and Priori Knowledge CLIP for action recognition

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints