Abstract:Visual imitation learning is a promising approach that promotes robots to learn skills from visual demonstrations. However, current visual imitation learning approaches introduce unreasonable assumptions that the contexts of the visual demonstrations and the robot observations are consistent, which affects the flexibility and scalability of the approaches. It is a key challenge for robots to learn from visual demonstrations with inconsistent contexts. Inconsistent contexts may cause a serious difference in the pixel distribution of the operator and the environment, which makes vision-based control policies hardly effective. In this paper, we propose a novel imitation learning framework to enable robots to reproduce behavior by watching human demonstrations with inconsistent contexts, such as different viewpoints, operators, backgrounds, object appearances and positions. Specifically, our framework consists of three networks: flow-based viewpoint transformation network (FVTrans), robot2human alignment network (RANet) and inverse dynamics network (IDNet). First, FVTrans transforms various third-person demonstrations into the fixed robot execution view. With a meta learning strategy, FVTrans can quickly adapt to novel contexts with few samples. Then, RANet aligns the human and the robot at the feature level. Therefore, the demonstration feature can be used as a subgoal of the current moment. Finally, IDNet predicts the joint angles of the robot. We collect a multi-context dataset on the real robot (UR5) for three tasks, including grasping cups, sweeping garbage and placing objects. We empirically demonstrate that our framework can perform three tasks with a high success rate and be effectively generalized to different contexts.

Robot Learning from Human Demonstrations with Inconsistent Contexts

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Cross-context Visual Imitation Learning from Demonstrations.

Vision-based Robot Manipulation Learning via Human Demonstrations

Learning Human-to-Robot Handovers from Point Clouds.

Watch and Act: Learning Robotic Manipulation from Visual Demonstration.

Zero-shot Imitation Learning from Demonstrations for Legged Robot Visual Navigation

Learning from demonstrations: An intuitive VR environment for imitation learning of construction robots

Generalized Robot Learning Framework

Interactive Visual Task Learning for Robots

Robotic Imitation of Human Actions

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

Visual Imitation Made Easy

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

Robot imitation from multimodal observation with unsupervised cross-modal representation

Learning Generalizable 3D Manipulation With 10 Demonstrations

An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos

Vision-based Robotic Arm Imitation by Human Gesture

A Task-Learning Strategy for Robotic Assembly Tasks from Human Demonstrations

Contrast, Imitate, Adapt: Learning Robotic Skills From Raw Human Videos

A Human–Robot Collaboration Method Using a Pose Estimation Network for Robot Learning of Assembly Manipulation Trajectories From Demonstration Videos