Abstract:Enabling robots to learn manipulation tasks by observing human demonstrations remains a major challenge. Recent advances in video captioning tasks provide an end-to-end method to translate demonstration videos into robotic commands. Compared with general video captioning tasks, Video to Command (V2C) task faces two key challenges: (1) How to extract key frames containing fine-grained manipulation actions from demonstration videos that contain a large amount of redundant information; (2) How to significantly improve the accuracy of generated commands so that the V2C method can be applied to real robot tasks. In response to the above problems, we propose a multi-modal framework for robots to learn manipulation tasks from human demonstrations. This framework consists of five components: Text Encoder, Video Encoder, Action Classifier, Keyframe Aligner and Command Decoder. In this framework, we have mainly done two aspects of work: (1) The key frame information of the video is extracted, and the effect of key frame information on improving the translation accuracy of robot commands is analyzed; (2) Based on the video and caption text information, we explore the effect of multimodal information fusion on improving the accuracy of the command generated by the model. Experiments show that our model is significantly superior to the existing methods on the standard metrics of video captioning tasks, such as BLEU_N, METEOR, ROUGE_L, and CIDEr. Among them, the performance of the variant model CGM-V using only video information on BLEU_4 is increased by 0.8%, and that of the variant model CGM-M using multi-modal information on BLEU_4 is significantly increased by 43.7%. Furthermore, our framework, when combined with an affordance detection network and a motion planner, can enable the robot to reproduce the tasks in the demonstration. Our source code and expanded annotations for the IIT-V2C dataset are at https://github.com/yin0816/CGM-M.

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Understanding Action Sequences based on Video Captioning for Learning-from-Observation

Enhancing Robot Manipulation Skill Learning with Multi-task Capability Based on Transformer and Token Reduction

Learning Multi-Step Manipulation Tasks from A Single Human Demonstration

A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations

Audio-visual scene understanding utilizing text information for a cooking support robot

Learning Actions from Human Demonstration Video for Robotic Manipulation

An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos

CineTransfer: Controlling a Robot to Imitate Cinematographic Style from a Single Example

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Robobarista: Learning to Manipulate Novel Objects via Deep Multimodal Embedding

Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

Robot Cooking with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects

Transformers for One-Shot Visual Imitation

This&That: Language-Gesture Controlled Video Generation for Robot Planning

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Exploiting Information Theory for Intuitive Robot Programming of Manual Activities

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model