Abstract:Enabling robots to learn manipulation tasks by observing human demonstrations remains a major challenge. Recent advances in video captioning tasks provide an end-to-end method to translate demonstration videos into robotic commands. Compared with general video captioning tasks, Video to Command (V2C) task faces two key challenges: (1) How to extract key frames containing fine-grained manipulation actions from demonstration videos that contain a large amount of redundant information; (2) How to significantly improve the accuracy of generated commands so that the V2C method can be applied to real robot tasks. In response to the above problems, we propose a multi-modal framework for robots to learn manipulation tasks from human demonstrations. This framework consists of five components: Text Encoder, Video Encoder, Action Classifier, Keyframe Aligner and Command Decoder. In this framework, we have mainly done two aspects of work: (1) The key frame information of the video is extracted, and the effect of key frame information on improving the translation accuracy of robot commands is analyzed; (2) Based on the video and caption text information, we explore the effect of multimodal information fusion on improving the accuracy of the command generated by the model. Experiments show that our model is significantly superior to the existing methods on the standard metrics of video captioning tasks, such as BLEU_N, METEOR, ROUGE_L, and CIDEr. Among them, the performance of the variant model CGM-V using only video information on BLEU_4 is increased by 0.8%, and that of the variant model CGM-M using multi-modal information on BLEU_4 is significantly increased by 43.7%. Furthermore, our framework, when combined with an affordance detection network and a motion planner, can enable the robot to reproduce the tasks in the demonstration. Our source code and expanded annotations for the IIT-V2C dataset are at https://github.com/yin0816/CGM-M.

Composable Instructions and Prospection Guided Visuomotor Control for Robotic Manipulation.

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Vision-Based Categorical Object Pose Estimation and Manipulation.

Whole-Body Inverse Kinematics and Operation-Oriented Motion Planning for Robot Mobile Manipulation

Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

Natural Language Instruction Understanding for Robotic Manipulation: a Multisensory Perception Approach.

Multifingered Robot Hand Compliant Manipulation Based on Vision-Based Demonstration and Adaptive Force Control

Learning Robotic Manipulation from Demonstrations by Combining Deep Generative Model and Dynamic Control System

Integrating Historical Learning and Multi-View Attention with Hierarchical Feature Fusion for Robotic Manipulation

Human-oriented Representation Learning for Robotic Manipulation

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Learning to Imagine Manipulation Goals for Robot Task Planning

Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-To-End Learning from Demonstration

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

Research on Task Decomposition and Motion Trajectory Optimization of Robotic Arm Based on VLA Large Model

Vision-based reinforcement learning control of soft robot manipulators

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models