Abstract:What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. Surprisingly, despite methods aiming to maximise regret in theory, the practical approximations do not correlate with regret but with success rate. As a result, a significant portion of an agent's experience comes from environments it has already mastered, offering little to no contribution toward enhancing its abilities. Put differently, current methods fail to predict intuitive measures of ``learnability.'' Specifically, they are unable to consistently identify those scenarios that the agent can sometimes solve, but not always. Based on our analysis, we develop a method that directly trains on scenarios with high learnability. This simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: <a class="link-external link-https" href="https://github.com/amacrutherford/sampling-for-learnability" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to select the environment for training reinforcement learning (RL) agents to improve their performance in downstream tasks. Specifically, the author focuses on how the unsupervised environment design (UED) methods select training environments, especially on the performance of these methods in task - priority metrics. ### Problem Background 1. **Automated Curriculum Discovery**: Automatically generating training environments for RL agents is a long - standing and active research area. Automated curriculum learning (ACL) methods can generate diverse environments, thus cultivating more general - purpose and robust agents. 2. **Unsupervised Environment Design (UED)**: In recent years, a class of UED - based methods has attracted attention due to their theoretical robustness and empirical generalization ability. These methods aim to maximize "regret", that is, the performance difference between the optimal strategy and the current strategy. ### Shortcomings of Existing Methods Although maximizing regret is theoretically advantageous, in practical applications, calculating regret is infeasible, so approximate methods need to be used. However, the existing approximate methods do not reflect regret well but are more related to the success rate. This has led to many training experiences coming from environments that the agent has already mastered, contributing little to improving its capabilities. ### Core Problems of the Paper - **Problems with the Selection Mechanism of Existing UED Methods**: When current UED methods select training environments, the scoring functions used (such as MaxMC (Maximum Monte Carlo Score) and PVL (Positive Value Loss)) cannot effectively predict "learnability", that is, tasks that the agent can sometimes solve but not always. - **Improving the Environment Selection Mechanism**: To overcome this problem, the author proposes a new method - Sampling For Learnability (SFL), which directly selects environments with high learnability for training. ### Solutions 1. **Defining Learnability**: The author defines learnability as the product of the success probability \(p\) and the failure probability \(1 - p\), that is, \(p\cdot(1 - p)\), which reflects the situation where the agent neither fully masters nor is completely unable to solve in a certain environment. 2. **Proposing the SFL Method**: SFL randomly samples environments and selects those environments that the agent can sometimes solve but not always for training. Experimental results show that this method is significantly superior to existing UED methods in multiple challenging environments. 3. **Introducing a New Evaluation Protocol**: To more strictly evaluate the robustness of ACL methods, the author introduces a new evaluation protocol based on Conditional Value - at - Risk (CVaR), which directly measures the agent's performance in the worst - case scenario. ### Summary This paper reveals the problems existing in the selection of training environments by existing UED methods and proposes a new, simple and intuitive method SFL to improve the environment selection mechanism. Experimental results show that SFL is significantly superior to existing methods in multiple environments and can better cultivate robust RL agents.

No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Evolving Curricula with Regret-Based Environment Design

Refining Minimax Regret for Unsupervised Environment Design

Adversarial Environment Design via Regret-Guided Diffusion Models

Learning Curricula in Open-Ended Worlds

Learning not to Regret

Not All Errors Are Made Equal: A Regret Metric for Detecting System-level Trajectory Prediction Failures

Do LLM Agents Have Regret? A Case Study in Online Learning and Games

Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

Combining Counterfactual Regret Minimization with Information Gain to Solve Extensive Games with Unknown Environments

Data-Driven Online Model Selection With Regret Guarantees

Regret Minimization for Partially Observable Deep Reinforcement Learning

The Advantage Regret-Matching Actor-Critic

Stabilizing Unsupervised Environment Design with a Learned Adversary

Adversarial Environment Generation for Learning to Navigate the Web

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

Reward-Free Curricula for Training Robust World Models

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments