CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Gi-Cheon Kang,Junghyun Kim,Kyuhwan Shim,Jun Ki Lee,Byoung-Tak Zhang
2024-11-01
Abstract:This paper explores how non-experts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how non-experts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., "move forward") and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).
Robotics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to enable non - expert users to teach robots to master the required skills through natural - language supervision. Specifically, the paper focuses on two key aspects: 1. **How to enable non - expert users to collect robot data through natural - language supervision**: Traditional robot data collection usually requires experts to operate robots or use complex remote - operating systems, which makes it difficult for non - experts to participate. This paper proposes a framework that allows non - expert users to collect demonstration data of robots through natural - language instructions (such as "move forward"). 2. **How to enable pre - trained vision - language models to learn end - to - end policies from natural - language supervision**: The authors propose a new model named CLIP - RT, which can directly learn language - conditioned robot policies from natural - language supervision. CLIP - RT utilizes the contrastive imitation learning method to predict actions by using natural language as a supervision signal. ### Main contributions 1. **Proposing the CLIP - RT model**: This is a CLIP - based vision - language - action (VLA) model that can learn language - conditioned robot policies from natural - language supervision. 2. **Proposing a data - collection framework**: This framework enables non - expert users to collect robot data only through natural language and expand these data through automatic data - augmentation methods (such as random trajectory diversification, STD). 3. **Experimentally verifying the effectiveness of CLIP - RT**: In 10 new manipulation tasks, CLIP - RT outperforms the existing state - of - the - art model OpenVLA, with an average success rate improvement of 17% and using only one - seventh of the number of parameters of OpenVLA (1B vs 7B). 4. **Ablation studies demonstrating the importance of key components**: Through ablation studies, the authors prove the advantages of pre - trained vision - language models (such as CLIP) under natural - language supervision and the effectiveness of random trajectory diversification (STD) in the case of data scarcity. ### Formula summary The main formulas involved in the paper include: - **Imitation - learning loss function under language conditioning**: \[ L_{\text{il}}=-E_{(v, \ell, a)\sim D}[\log\pi_\theta(a | v, \ell)] \] where $\pi_\theta$ represents the policy model with model parameters $\theta$. - **Contrast - learning loss function**: \[ L_{\text{cl}} =-\frac{1}{2M}\sum_{i = 1}^M\sum_{j = 1}^M\left[y_{ij}\log\phi_I(s_{ij})+y_{ij}\log\phi_T(s_{ij})\right] \] where $s_{ij}$ is the cosine similarity between the embedding vectors of image $I_i$ and text $T_j$, and $\phi_I(s_{ij})$ and $\phi_T(s_{ij})$ are the similarity calculations from image to text and from text to image respectively. - **Contrast - imitation - learning loss function**: \[ L_{\text{cil}}=-\frac{1}{M^2}\sum_{i = 1}^M\sum_{j = 1}^M\left[y_{ij}\log\sigma(s_{ij})+(1 - y_{ij})\log(1 - \sigma(s_{ij}))\right] \] where $\sigma(s_{ij})=\frac{1}{1+\exp(-\text{sim}(c_i, z_j))}$, and $c_i$ and $z_j$ are the context embedding and the language - supervision embedding respectively. Through these methods, CLIP - RT can effectively learn from natural - language supervision and perform well on new tasks.