Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

Tongzhou Wang,Antonio Torralba,Phillip Isola,Amy Zhang
2023-11-27
Abstract:In goal-reaching reinforcement learning (RL), the optimal value function has a particular geometry, called quasimetric structure. This paper introduces Quasimetric Reinforcement Learning (QRL), a new RL method that utilizes quasimetric models to learn optimal value functions. Distinct from prior approaches, the QRL objective is specifically designed for quasimetrics, and provides strong theoretical recovery guarantees. Empirically, we conduct thorough analyses on a discretized MountainCar environment, identifying properties of QRL and its advantages over alternatives. On offline and online goal-reaching benchmarks, QRL also demonstrates improved sample efficiency and performance, across both state-based and image-based observations.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to use the quasimetric structure to learn the optimal value function more effectively in goal - reaching reinforcement learning (RL). Specifically, the paper proposes Quasimetric Reinforcement Learning (QRL), a new RL method that learns the optimal value function through a quasimetric model, aiming to improve sample efficiency and performance. ### Main problems 1. **Differences between single - task and multi - task RL**: - In single - task RL, the value function can be an arbitrary function without a specific structure. - In multi - task RL, the value function \( V^*(s; g) \) under the goal condition has a quasimetric structure, that is, it satisfies the triangle inequality but does not require symmetry. 2. **Applications of the quasimetric model**: - The quasimetric model can capture complex dynamic environments, while the traditional symmetric metric model cannot do this. - By optimizing the quasimetric model, the separation between states can be maximized while maintaining the local distance, so as to accurately learn the optimal value function. 3. **Specific goals of QRL**: - **Local constraint**: Ensure that the quasimetric model \( d_\theta \) does not overestimate the local cost, that is, for each transition \((s, a, s', r)\), \( d_\theta(s, s')\leq - r\). - **Global constraint**: Since \( d_\theta \) is a quasimetric and satisfies the triangle inequality, for each state \( s \) and goal \( g \), any path connecting \( s \) to \( g \) will impose a constraint on \( d_\theta(s, g)\), that is, \( d_\theta(s, g)\leq \) the total cost of the path. ### Solutions - **QRL framework**: - Use the quasimetric model \( d_\theta \) to parameterize the value function \( V^*\) under the goal condition. - Learn \( d_\theta \) by optimizing the objective function to ensure that it satisfies local and global constraints. - The form of the objective function is: \[ \max_\theta \mathbb{E}_{s\sim p_{\text{state}}, g\sim p_{\text{goal}}}[d_\theta(s, g)] \] where \(\mathbb{E}_{(s, a, s', r)\sim p_{\text{transition}}}[\text{relu}(d_\theta(s, s') + r)^2]\leq \epsilon^2\), \(\epsilon > 0\) is a small constant, and \(\text{relu}(x)=\max(x, 0)\) is used to prevent \( d_\theta(s, s')\) from exceeding the transition cost \(-r\). - **Theoretical guarantee**: - Provide theoretical recovery guarantees to ensure that QRL can learn the optimal value function under a specific MDP. - **Experimental verification**: - In offline and online goal - reaching benchmark tests, QRL shows improved sample efficiency and performance, especially in state - based and image - based observations. ### Summary This paper solves the challenges of value function learning in multi - task goal - reaching RL by introducing the quasimetric model and the QRL framework, and improves learning efficiency and performance.