Abstract:Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at <a class="link-external link-https" href="https://seohong.me/projects/metra/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to make unsupervised reinforcement learning (RL) methods scalable in complex, high - dimensional environments (such as pixel - based environments). Specifically, existing unsupervised RL methods face two major challenges when dealing with complex environments: 1. **Pure exploration methods**: These methods attempt to fully cover the entire state space or fully capture the environmental dynamics, but in complex environments with large state spaces, this is usually not feasible. For example, in the 29 - dimensional MuJoCo Ant environment, these methods cannot cover the entire state space. 2. **Unsupervised skill discovery methods**: These methods discover diverse and distinguishable behaviors by maximizing the mutual information (MI) between states and skills. However, these methods are often limited to simple static behaviors, and in the absence of supervision entirely, the coverage of the state space is limited, especially difficult to directly scale in pixel - based control environments. To solve these problems, the paper proposes a new unsupervised RL objective, called **Metric - Aware Abstraction (METRA)**. The main ideas of METRA are: - Instead of directly covering the entire state space, cover a compact latent metric space $ Z $, which is connected to the state space $ S $ through a mapping function $ \phi:S \to Z $. - Use **temporal distance** (i.e., the minimum number of environmental steps between two states) as the metric in the latent space, rather than the traditional Euclidean distance. Temporal distance is invariant to state representations and is therefore suitable for pixel - based environments. In this way, METRA can learn diverse and useful behaviors in complex, high - dimensional environments without having to fully cover every possible state. Experimental results show that METRA has for the first time discovered diverse movement behaviors in pixel - based Quadruped and Humanoid environments. ### Formula summary - **Wasserstein dependence measure (WDM)**: \[ I_W(S;Z)=W(p(s, z), p(s)p(z)) \] where $ W $ is the 1 - Wasserstein distance on the metric space $(S\times Z, d)$, and $ d $ is the given distance metric. - **Optimization objective**: \[ I_W(S;Z)\approx\sup_{\|\phi\|_L\leq1}\mathbb{E}_{(s,z,s')\sim D}[(\phi(s') - \phi(s))^\top z+\lambda\cdot\min(\epsilon, 1 - \|\phi(s) - \phi(s')\|_2^2)] \] - **Constraint condition**: \[ \|\phi(s) - \phi(s')\|_2\leq1,\quad\forall(s, s')\in S_{adj} \] where $ S_{adj} $ represents the set of adjacent state pairs in the MDP. Through this method, METRA can effectively perform unsupervised learning in complex environments and provide useful skills for downstream tasks.

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Meta Reinforcement Learning of Locomotion Policy for Quadruped Robots with Motor Stuck

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

Robot Learning of Mobile Manipulation with Reachability Behavior Priors

Policy-Independent Behavioral Metric-Based Representation for Deep Reinforcement Learning

SLR: Learning Quadruped Locomotion without Privileged Information

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

MetaLoco: Universal Quadrupedal Locomotion with Meta-Reinforcement Learning and Motion Imitation

Environment as Policy: Learning to Race in Unseen Tracks

How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

Predictable MDP Abstraction for Unsupervised Model-Based RL

Learning Agile Locomotion and Adaptive Behaviors via RL-augmented MPC

Learning to Walk from Three Minutes of Real-World Data with Semi-structured Dynamics Models

Guided Meta-Policy Search

HELSA: Hierarchical Reinforcement Learning with Spatiotemporal Abstraction for Large-Scale Multi-Agent Path Finding

Dexterous Legged Locomotion in Confined 3D Spaces with Reinforcement Learning

Subequivariant Graph Reinforcement Learning in 3D Environments

Hybrid Information-driven Multi-agent Reinforcement Learning

Learning Sparse Control Tasks from Pixels by Latent Nearest-Neighbor-Guided Explorations

MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning