Abstract:Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: in discriminability - motivated reinforcement learning (RL), how to select goals through a new method - "Diversity Progress" (DP) to improve the efficiency and effectiveness of learning multiple different skills. Specifically, the paper proposes a method for learning goal - selection strategies, called "Diversity Progress" (DP). This method accelerates the learning of diverse skills by preferentially selecting those goals that can significantly improve discriminability. Compared with the traditional uniform random selection of goals, DP can more effectively avoid the problem of goal - distribution collapse and can learn a set of distinguishable skills in a shorter time. ### Core contributions of the paper: 1. **Proposing the Diversity Progress (DP) method**: This is a method of forming a curriculum based on the observed improvement in the discriminability of the goal set. It is applicable to discriminability - based intrinsically - motivated agents, which obtain intrinsic rewards by calculating the agent's certainty about the true goal being pursued. 2. **Empirical research**: The experimental results show that agents motivated by DP can learn a set of distinguishable skills faster than previous methods and do not suffer from the problem of goal - distribution collapse. 3. **Future work plans**: The author details the plans to further develop this proof - of - concept, including testing other intrinsic rewards, evaluating performance in different environments, etc. ### Specific problems solved: - **Goal - distribution collapse**: Some previous methods (such as VIC) gradually focus on only a few skills during the training process, resulting in a reduction in the number of effective skills. DP avoids this collapse phenomenon by dynamically adjusting the goal - selection probability. - **Low learning efficiency**: Traditional methods are usually uniformly random when selecting goals, which may lead to low learning efficiency. DP improves learning efficiency by preferentially selecting those goals that can bring more discriminability progress. - **Lack of diverse skills**: In multi - skill learning tasks, ensuring that the learned skills are diverse is a key challenge. DP promotes more diverse skill learning by maximizing the discriminability between goals. ### Formula summary: - **Discriminability objective function**: \[ I(g; f(T_{\pi_g})) := H(g) - H(g | f(T_{\pi_g})) \] where \(H\) represents Shannon entropy. - **Variational lower bound**: \[ \tilde{I}(g; f(T_{\pi_g})) \geq H(g) - E_{g \sim p(g), T_{\pi_g} \sim \pi(g)} \left[ \log q(g | f(T_{\pi_g})) \right] \] - **Learning progress (LP)**: \[ LP_n(t + 1) := e_n(t + 1-\tau) - e_n(t + 1) \] - **Diversity progress (DP)**: \[ DP(t + 1) := \frac{1}{|G|} \sum_{g \in G} \left( e_g(t + 1-\tau) - e_g(t + 1) \right) \] Through these formulas and methods, the paper demonstrates the potential of DP in promoting multi - skill learning and provides directions for future improvements.

Diversity Progress for Goal Selection in Discriminability-Motivated RL

Diversify & Conquer: Outcome-directed Curriculum RL via Out-of-Distribution Disagreement

Density-based Curriculum for Multi-goal Reinforcement Learning with Sparse Rewards

Iteratively Learning Novel Strategies with Diversity Measured in State Distances

Controlled Diversity with Preference : Towards Learning a Diverse Set of Desired Skills

Iteratively Learn Diverse Strategies with State Distance Information

Learning Diverse Policies with Soft Self-Generated Guidance

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Enhanced Generalization through Prioritization and Diversity in Self-Imitation Reinforcement Learning over Procedural Environments with Sparse Rewards

Quality-Similar Diversity via Population Based Reinforcement Learning

Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Learning to Reach Goals via Diffusion

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

TendencyRL: Multi-stage Discriminative Hints for Efficient Goal-Oriented Reverse Curriculum Learning.

Diversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decomposition

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning