Abstract:This paper explores how non-experts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how non-experts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., "move forward") and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to enable non - expert users to teach robots to master the required skills through natural - language supervision. Specifically, the paper focuses on two key aspects: 1. **How to enable non - expert users to collect robot data through natural - language supervision**: Traditional robot data collection usually requires experts to operate robots or use complex remote - operating systems, which makes it difficult for non - experts to participate. This paper proposes a framework that allows non - expert users to collect demonstration data of robots through natural - language instructions (such as "move forward"). 2. **How to enable pre - trained vision - language models to learn end - to - end policies from natural - language supervision**: The authors propose a new model named CLIP - RT, which can directly learn language - conditioned robot policies from natural - language supervision. CLIP - RT utilizes the contrastive imitation learning method to predict actions by using natural language as a supervision signal. ### Main contributions 1. **Proposing the CLIP - RT model**: This is a CLIP - based vision - language - action (VLA) model that can learn language - conditioned robot policies from natural - language supervision. 2. **Proposing a data - collection framework**: This framework enables non - expert users to collect robot data only through natural language and expand these data through automatic data - augmentation methods (such as random trajectory diversification, STD). 3. **Experimentally verifying the effectiveness of CLIP - RT**: In 10 new manipulation tasks, CLIP - RT outperforms the existing state - of - the - art model OpenVLA, with an average success rate improvement of 17% and using only one - seventh of the number of parameters of OpenVLA (1B vs 7B). 4. **Ablation studies demonstrating the importance of key components**: Through ablation studies, the authors prove the advantages of pre - trained vision - language models (such as CLIP) under natural - language supervision and the effectiveness of random trajectory diversification (STD) in the case of data scarcity. ### Formula summary The main formulas involved in the paper include: - **Imitation - learning loss function under language conditioning**: \[ L_{\text{il}}=-E_{(v, \ell, a)\sim D}[\log\pi_\theta(a | v, \ell)] \] where $\pi_\theta$ represents the policy model with model parameters $\theta$. - **Contrast - learning loss function**: \[ L_{\text{cl}} =-\frac{1}{2M}\sum_{i = 1}^M\sum_{j = 1}^M\left[y_{ij}\log\phi_I(s_{ij})+y_{ij}\log\phi_T(s_{ij})\right] \] where $s_{ij}$ is the cosine similarity between the embedding vectors of image $I_i$ and text $T_j$, and $\phi_I(s_{ij})$ and $\phi_T(s_{ij})$ are the similarity calculations from image to text and from text to image respectively. - **Contrast - imitation - learning loss function**: \[ L_{\text{cil}}=-\frac{1}{M^2}\sum_{i = 1}^M\sum_{j = 1}^M\left[y_{ij}\log\sigma(s_{ij})+(1 - y_{ij})\log(1 - \sigma(s_{ij}))\right] \] where $\sigma(s_{ij})=\frac{1}{1+\exp(-\text{sim}(c_i, z_j))}$, and $c_i$ and $z_j$ are the context embedding and the language - supervision embedding respectively. Through these methods, CLIP - RT can effectively learn from natural - language supervision and perform well on new tasks.

CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

Spatial-Language Attention Policies for Efficient Robot Learning

RT-H: Action Hierarchies Using Language

Language-Conditioned Imitation Learning for Robot Manipulation Tasks

Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

CLFR-M: Continual Learning Framework for Robots Via Human Feedback and Dynamic Memory

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition

Learning Latent Plans from Play

CLIP-Motion: Learning Reward Functions for Robotic Actions Using Consecutive Observations

Grounding Robot Policies with Visuomotor Language Guidance

Contrastive Language, Action, and State Pre-training for Robot Learning

Grounding Language with Visual Affordances over Unstructured Data

Natural Language Can Help Bridge the Sim2Real Gap

Interactive Robot Learning from Verbal Correction

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

STEER: Flexible Robotic Manipulation via Dense Language Grounding