Abstract:Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robotic actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robotic manipulation. We categorize existing methods into language-conditioned reward shaping, language-conditioned policy learning, neuro-symbolic artificial intelligence, and the utilization of foundational models (FMs) such as large language models (LLMs) and vision-language models (VLMs). Specifically, we analyze state-of-the-art techniques concerning semantic information extraction, environment and evaluation, auxiliary tasks, and task representation strategies. By conducting a comparative analysis, we highlight the strengths and limitations of current approaches in bridging language instructions with robot actions. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators. The GitHub repository of this paper can be found at <a class="link-external link-https" href="https://github.com/hk-zh/language-conditioned-robot-manipulation-models" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How can robots be made to understand and execute instructions given in the form of natural language, thereby achieving seamless communication and collaboration between humans and robots? Specifically, the research mainly focuses on the following aspects: 1. **Semantic Extraction**: How can robots effectively extract semantic information from natural - language commands? - The technologies involved include large - scale pre - trained language models (such as BERT, RoBERTa, GloVe, etc.), and large language models (LLMs). 2. **Scene Understanding**: How can robots focus on relevant parts in the scene according to language commands? - Advances in computer vision technologies, such as convolutional neural networks (CNN), multi - object detection networks (Faster - RCNN, YOLO), Transformer - based models (DETR, ViT), and visual - language models (VLMs) such as CLIP, Flamingo, etc., help robots link visual entities with language commands. 3. **Action Execution**: How can robots transform high - level image - language understanding into low - level precise mechanical actions? - Reinforcement learning (RL) and imitation learning (IL) are commonly used paradigms. In addition, researchers also design low - level actions, using traditional path planning and motion planning algorithms, combined with inverse kinematics (IK) to calculate joint configurations. ### Solution Overview To solve the above problems, the paper proposes the following methods: 1. **Language - conditioned Reward Shaping**: - **Sparse Reward**: Simplifies the reward design, but may lead to low sample efficiency. - **Dense Reward**: Improves sample efficiency by giving rewards during the progress of the task, but increases the complexity of the reward design. - **Reward Function Learning**: Learns from expert demonstrations or infers the reward function by using inverse reinforcement learning (IRL). 2. **Language - conditioned Policy Learning**: - **Reinforcement Learning**: Conducts policy learning based on language - conditioned rewards, which is suitable for games and robot operation tasks. - **Behavior Cloning**: Trains robots by imitating expert examples, minimizing the difference between expert actions and agent - predicted actions. - **Diffusion Policy Learning**: Combines the diffusion model in generative AI to improve imitation learning. 3. **Neuro - symbolic AI**: - Combines symbolic reasoning and deep learning to enhance the reasoning and generalization abilities of robots. 4. **Application of Foundation Models (FMs)**: - Utilizes large - scale pre - trained foundation models (such as LLMs and VLMs) to improve language understanding and task execution capabilities. ### Future Research Directions The paper also discusses future challenges and research directions, mainly including: - Improving generalization ability to enable robots to perform tasks in more diverse environments. - Solving safety problems to ensure the safety and reliability of robot operations under language conditions. - Exploring new model architectures, such as visual - language - action models (VLAs), to better integrate the understanding of vision, language, and action. Through these methods and technologies, the paper aims to promote the development of the field of robot operations under language conditions, making it more intelligent and practical.

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Grounding Language for Robotic Manipulation via Skill Library

EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI

Language-Conditioned Imitation Learning for Robot Manipulation Tasks

Multi-modal Interaction with Transformers: Bridging Robots and Human with Natural Language

Open-World Object Manipulation using Pre-trained Vision-Language Models

A Survey of Language-Based Communication in Robotics

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

"No, to the Right" -- Online Language Corrections for Robotic Manipulation via Shared Autonomy

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Language to Rewards for Robotic Skill Synthesis

Spatial-Language Attention Policies for Efficient Robot Learning

Non-Prehensile Tool-Object Manipulation by Integrating LLM-Based Planning and Manoeuvrability-Driven Controls

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach