No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Alexander Rutherford,Michael Beukman,Timon Willi,Bruno Lacerda,Nick Hawes,Jakob Foerster
2024-10-30
Abstract:What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. Surprisingly, despite methods aiming to maximise regret in theory, the practical approximations do not correlate with regret but with success rate. As a result, a significant portion of an agent's experience comes from environments it has already mastered, offering little to no contribution toward enhancing its abilities. Put differently, current methods fail to predict intuitive measures of ``learnability.'' Specifically, they are unable to consistently identify those scenarios that the agent can sometimes solve, but not always. Based on our analysis, we develop a method that directly trains on scenarios with high learnability. This simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: <a class="link-external link-https" href="https://github.com/amacrutherford/sampling-for-learnability" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to select the environment for training reinforcement learning (RL) agents to improve their performance in downstream tasks. Specifically, the author focuses on how the unsupervised environment design (UED) methods select training environments, especially on the performance of these methods in task - priority metrics. ### Problem Background 1. **Automated Curriculum Discovery**: Automatically generating training environments for RL agents is a long - standing and active research area. Automated curriculum learning (ACL) methods can generate diverse environments, thus cultivating more general - purpose and robust agents. 2. **Unsupervised Environment Design (UED)**: In recent years, a class of UED - based methods has attracted attention due to their theoretical robustness and empirical generalization ability. These methods aim to maximize "regret", that is, the performance difference between the optimal strategy and the current strategy. ### Shortcomings of Existing Methods Although maximizing regret is theoretically advantageous, in practical applications, calculating regret is infeasible, so approximate methods need to be used. However, the existing approximate methods do not reflect regret well but are more related to the success rate. This has led to many training experiences coming from environments that the agent has already mastered, contributing little to improving its capabilities. ### Core Problems of the Paper - **Problems with the Selection Mechanism of Existing UED Methods**: When current UED methods select training environments, the scoring functions used (such as MaxMC (Maximum Monte Carlo Score) and PVL (Positive Value Loss)) cannot effectively predict "learnability", that is, tasks that the agent can sometimes solve but not always. - **Improving the Environment Selection Mechanism**: To overcome this problem, the author proposes a new method - Sampling For Learnability (SFL), which directly selects environments with high learnability for training. ### Solutions 1. **Defining Learnability**: The author defines learnability as the product of the success probability \(p\) and the failure probability \(1 - p\), that is, \(p\cdot(1 - p)\), which reflects the situation where the agent neither fully masters nor is completely unable to solve in a certain environment. 2. **Proposing the SFL Method**: SFL randomly samples environments and selects those environments that the agent can sometimes solve but not always for training. Experimental results show that this method is significantly superior to existing UED methods in multiple challenging environments. 3. **Introducing a New Evaluation Protocol**: To more strictly evaluate the robustness of ACL methods, the author introduces a new evaluation protocol based on Conditional Value - at - Risk (CVaR), which directly measures the agent's performance in the worst - case scenario. ### Summary This paper reveals the problems existing in the selection of training environments by existing UED methods and proposes a new, simple and intuitive method SFL to improve the environment selection mechanism. Experimental results show that SFL is significantly superior to existing methods in multiple environments and can better cultivate robust RL agents.