Diversity Progress for Goal Selection in Discriminability-Motivated RL

Erik M. Lintunen,Nadia M. Ady,Christian Guckelsberger
2024-11-06
Abstract:Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: in discriminability - motivated reinforcement learning (RL), how to select goals through a new method - "Diversity Progress" (DP) to improve the efficiency and effectiveness of learning multiple different skills. Specifically, the paper proposes a method for learning goal - selection strategies, called "Diversity Progress" (DP). This method accelerates the learning of diverse skills by preferentially selecting those goals that can significantly improve discriminability. Compared with the traditional uniform random selection of goals, DP can more effectively avoid the problem of goal - distribution collapse and can learn a set of distinguishable skills in a shorter time. ### Core contributions of the paper: 1. **Proposing the Diversity Progress (DP) method**: This is a method of forming a curriculum based on the observed improvement in the discriminability of the goal set. It is applicable to discriminability - based intrinsically - motivated agents, which obtain intrinsic rewards by calculating the agent's certainty about the true goal being pursued. 2. **Empirical research**: The experimental results show that agents motivated by DP can learn a set of distinguishable skills faster than previous methods and do not suffer from the problem of goal - distribution collapse. 3. **Future work plans**: The author details the plans to further develop this proof - of - concept, including testing other intrinsic rewards, evaluating performance in different environments, etc. ### Specific problems solved: - **Goal - distribution collapse**: Some previous methods (such as VIC) gradually focus on only a few skills during the training process, resulting in a reduction in the number of effective skills. DP avoids this collapse phenomenon by dynamically adjusting the goal - selection probability. - **Low learning efficiency**: Traditional methods are usually uniformly random when selecting goals, which may lead to low learning efficiency. DP improves learning efficiency by preferentially selecting those goals that can bring more discriminability progress. - **Lack of diverse skills**: In multi - skill learning tasks, ensuring that the learned skills are diverse is a key challenge. DP promotes more diverse skill learning by maximizing the discriminability between goals. ### Formula summary: - **Discriminability objective function**: \[ I(g; f(T_{\pi_g})) := H(g) - H(g | f(T_{\pi_g})) \] where \(H\) represents Shannon entropy. - **Variational lower bound**: \[ \tilde{I}(g; f(T_{\pi_g})) \geq H(g) - E_{g \sim p(g), T_{\pi_g} \sim \pi(g)} \left[ \log q(g | f(T_{\pi_g})) \right] \] - **Learning progress (LP)**: \[ LP_n(t + 1) := e_n(t + 1-\tau) - e_n(t + 1) \] - **Diversity progress (DP)**: \[ DP(t + 1) := \frac{1}{|G|} \sum_{g \in G} \left( e_g(t + 1-\tau) - e_g(t + 1) \right) \] Through these formulas and methods, the paper demonstrates the potential of DP in promoting multi - skill learning and provides directions for future improvements.