Adaptive Multi-Goal Exploration

Jean Tarbouriech,Omar Darwiche Domingues,Pierre Ménard,Matteo Pirotta,Michal Valko,Alessandro Lazaric
DOI: https://doi.org/10.48550/arXiv.2111.12045
2022-02-24
Abstract:We introduce a generic strategy for provably efficient multi-goal exploration. It relies on AdaGoal, a novel goal selection scheme that leverages a measure of uncertainty in reaching states to adaptively target goals that are neither too difficult nor too easy. We show how AdaGoal can be used to tackle the objective of learning an $\epsilon$-optimal goal-conditioned policy for the (initially unknown) set of goal states that are reachable within $L$ steps in expectation from a reference state $s_0$ in a reward-free Markov decision process. In the tabular case with $S$ states and $A$ actions, our algorithm requires $\tilde{O}(L^3 S A \epsilon^{-2})$ exploration steps, which is nearly minimax optimal. We also readily instantiate AdaGoal in linear mixture Markov decision processes, yielding the first goal-oriented PAC guarantee with linear function approximation. Beyond its strong theoretical guarantees, we anchor AdaGoal in goal-conditioned deep reinforcement learning, both conceptually and empirically, by connecting its idea of selecting "uncertain" goals to maximizing value ensemble disagreement.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently explore multi - goal environments and learn an approximately optimal goal - conditioned policy in unsupervised goal - conditioned reinforcement learning (GC - RL). Specifically, the paper proposes a novel goal - selection scheme named AdaGoal, which aims to optimize the exploration process by adaptively selecting goal states with intermediate difficulty. This helps the algorithm effectively learn how to reach a series of unknown goal states in the environment without relying on external reward signals. ### Main contributions of the paper: 1. **Formalize the multi - goal exploration (MGE) objective**: Minimize the number of exploration steps (i.e., sample complexity) to learn a goal - conditioned policy that is nearly optimal for all goal states reachable from the initial state within the expected number of steps. 2. **Introduce AdaGoal**: A new goal - selection scheme that depends on a simple optimization problem and adaptively targets goal states with intermediate difficulty. It also provides an algorithm - stopping rule and a set of candidate goal states that the agent is confident it can reliably reach. 3. **Design AdaGoal - UCBVI**: Implement AdaGoal in tabular Markov decision processes and prove that its sample complexity is nearly optimal. 4. **Design AdaGoal - UCRL·VTR**: Implement AdaGoal in linear - mixture Markov decision processes, which is the first method with goal - oriented PAC guarantees for linear function approximation. 5. **Application in deep GC - RL**: By connecting the idea of selecting "uncertain" goals with a practical approximation method for maximizing the difference in value sets, apply the concept and empirical study of AdaGoal to deep goal - conditioned reinforcement learning. ### Mathematical formulation of the core problem: - **Definition 1**: For any policy \(\pi\) and a pair of states \((s, s')\), \(V^\pi(s \to s')\) represents the expected number of steps to reach \(s'\) from \(s\) by executing policy \(\pi\). - **Definition 2**: For any threshold \(L \geq 1\), if \(V^\star(s_0 \to g) \leq L\), then the goal state \(g\) is said to be reliably \(L\)-reachable, denoted as \(G_L\). - **Definition 4**: A multi - goal exploration (MGE) algorithm is called \((\epsilon, \delta, L, G)\)-PAC if it stops in polynomial time and returns a set of goal states \(X\) and a set of policies \(\{\hat{\pi}_g\}_{g \in X}\) such that: - \( \forall g \in X, V^{\hat{\pi}_g}(s_0 \to g) - V^\star(s_0 \to g) \leq \epsilon \) - \( G_L \subseteq X \subseteq G_{L+\epsilon} \) ### Key assumptions and conclusions: - **Assumption 3**: The action space contains a known reset action \(a_{\text{reset}}\) such that executing \(a_{\text{reset}}\) from any state \(s\) will return to the initial state \(s_0\). - **Lemma 5**: MGE can be solved in polynomial time, while MGE without a reset action requires exponential time. - **Lemma 6**: For any \((\epsilon, \delta, L, G)\)-PAC MGE algorithm, there exists an MDP and a goal space such that the algorithm requires at least \(\Omega(L^3 SA \epsilon^{-2})\) steps to stop. - **Theorem 8**: AdaGoal - UCBVI is \((\epsilon, \delta, L, S)\)-PAC, and its sample complexity is \(\til