METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Seohong Park,Oleh Rybkin,Sergey Levine
2024-03-10
Abstract:Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at <a class="link-external link-https" href="https://seohong.me/projects/metra/" rel="external noopener nofollow">this https URL</a>
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to make unsupervised reinforcement learning (RL) methods scalable in complex, high - dimensional environments (such as pixel - based environments). Specifically, existing unsupervised RL methods face two major challenges when dealing with complex environments: 1. **Pure exploration methods**: These methods attempt to fully cover the entire state space or fully capture the environmental dynamics, but in complex environments with large state spaces, this is usually not feasible. For example, in the 29 - dimensional MuJoCo Ant environment, these methods cannot cover the entire state space. 2. **Unsupervised skill discovery methods**: These methods discover diverse and distinguishable behaviors by maximizing the mutual information (MI) between states and skills. However, these methods are often limited to simple static behaviors, and in the absence of supervision entirely, the coverage of the state space is limited, especially difficult to directly scale in pixel - based control environments. To solve these problems, the paper proposes a new unsupervised RL objective, called **Metric - Aware Abstraction (METRA)**. The main ideas of METRA are: - Instead of directly covering the entire state space, cover a compact latent metric space \( Z \), which is connected to the state space \( S \) through a mapping function \( \phi:S \to Z \). - Use **temporal distance** (i.e., the minimum number of environmental steps between two states) as the metric in the latent space, rather than the traditional Euclidean distance. Temporal distance is invariant to state representations and is therefore suitable for pixel - based environments. In this way, METRA can learn diverse and useful behaviors in complex, high - dimensional environments without having to fully cover every possible state. Experimental results show that METRA has for the first time discovered diverse movement behaviors in pixel - based Quadruped and Humanoid environments. ### Formula summary - **Wasserstein dependence measure (WDM)**: \[ I_W(S;Z)=W(p(s, z), p(s)p(z)) \] where \( W \) is the 1 - Wasserstein distance on the metric space \((S\times Z, d)\), and \( d \) is the given distance metric. - **Optimization objective**: \[ I_W(S;Z)\approx\sup_{\|\phi\|_L\leq1}\mathbb{E}_{(s,z,s')\sim D}[(\phi(s') - \phi(s))^\top z+\lambda\cdot\min(\epsilon, 1 - \|\phi(s) - \phi(s')\|_2^2)] \] - **Constraint condition**: \[ \|\phi(s) - \phi(s')\|_2\leq1,\quad\forall(s, s')\in S_{adj} \] where \( S_{adj} \) represents the set of adjacent state pairs in the MDP. Through this method, METRA can effectively perform unsupervised learning in complex environments and provide useful skills for downstream tasks.