Abstract:In this paper, we study the problem of learning a repertoire of low-level skills from raw images that can be sequenced to complete long-horizon visuomotor tasks. Reinforcement learning (RL) is a promising approach for acquiring short-horizon skills autonomously. However, the focus of RL algorithms has largely been on the success of those individual skills, more so than learning and grounding a large repertoire of skills that can be sequenced to complete extended multi-stage tasks. The latter demands robustness and persistence, as errors in skills can compound over time, and may require the robot to have a number of primitive skills in its repertoire, rather than just one. To this end, we introduce EMBER, a model-based RL method for learning primitive skills that are suitable for completing long-horizon visuomotor tasks. EMBER learns and plans using a learned model, critic, and success classifier, where the success classifier serves both as a reward function for RL and as a grounding mechanism to continuously detect if the robot should retry a skill when unsuccessful or under perturbations. Further, the learned model is task-agnostic and trained using data from all skills, enabling the robot to efficiently learn a number of distinct primitives. These visuomotor primitive skills and their associated pre- and post-conditions can then be directly combined with off-the-shelf symbolic planners to complete long-horizon tasks. On a Franka Emika robot arm, we find that EMBER enables the robot to complete three long-horizon visuomotor tasks at 85% success rate, such as organizing an office desk, a file cabinet, and drawers, which require sequencing up to 12 skills, involve 14 unique learned primitives, and demand generalization to novel objects.

One-Shot Robust Imitation Learning for Long-Horizon Visuomotor Tasks from Unsegmented Demonstrations

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

One-Shot Visual Imitation Learning via Meta-Learning

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks

MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

One-Shot Imitation Learning with Invariance Matching for Robotic Manipulation

One-shot Imitation in a Non-Stationary Environment via Multi-Modal Skill

Sequential robot imitation learning from observations

Efficient Robot Skill Learning with Imitation from a Single Video for Contact-Rich Fabric Manipulation

Vision-Based One-Shot Imitation Learning Supplemented with Target Recognition via Meta Learning

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

One-Shot Imitation under Mismatched Execution

Learning One-Shot Imitation From Humans Without Humans

Curriculum-Based Imitation of Versatile Skills

K-VIL: Keypoints-based Visual Imitation Learning

One-shot Imitation Learning via Interaction Warping

One-Shot Domain-Adaptive Imitation Learning via Progressive Learning

Robot Learning from Human Demonstrations with Inconsistent Contexts

Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks

Single-Shot Learning of Stable Dynamical Systems for Long-Horizon Manipulation Tasks