Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts

Onur Celik,Aleksandar Taranovic,Gerhard Neumann
2024-06-10
Abstract:Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose \textbf{Di}verse \textbf{Skil}l \textbf{L}earning (Di-SkilL\footnote{Videos and code are available on the project webpage: \url{<a class="link-external link-https" href="https://alrhub.github.io/di-skill-website/" rel="external noopener nofollow">this https URL</a>}}), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.
Machine Learning,Robotics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to effectively learn diverse skills in Reinforcement Learning (RL)**. Specifically, the paper proposes a new method - **Di - SkilL (Diverse Skill Learning)** to overcome the challenges encountered by traditional RL methods when learning diverse skills. ### Problem Background Traditional RL methods usually use Gaussian policy parameterization, which makes it difficult to learn diverse skills. The Gaussian policy can only capture a single behavior pattern and cannot effectively represent multi - modal behaviors. Therefore, when facing complex tasks, these methods are often difficult to adapt to different situations, resulting in limited performance. ### Solutions Proposed in the Paper To address this challenge, the paper proposes **Di - SkilL**, an RL method based on the Mixture of Experts (MoE) and automatic curriculum learning. Specifically: 1. **Mixture of Experts (MoE)**: - Each expert is designed as a contextual motion primitive that can capture skills in specific situations. - By optimizing each expert and its corresponding context distribution, it encourages the learning of diverse skills in similar contexts. 2. **Automatic Curriculum Learning**: - The per - expert context distribution is introduced, allowing each expert to focus on the sub - area of the situation where it is best at. - This automatic curriculum learning mechanism allows each expert to select training samples according to its own preference, thereby improving learning efficiency. 3. **Energy - Based Models (EBM)**: - Use energy - based models to represent the context distribution of each expert, solving the problems of hard discontinuity and multi - modality that are difficult to handle in traditional methods. - EBM can be approximately trained by the real - world context distribution of the environment, avoiding the need for prior knowledge of the environment. ### Main Contributions - Proposed Di - SkilL, an RL method capable of learning diverse, non - linear skills in a continuous context space. - By introducing energy - based models and automatic curriculum learning mechanisms, solved the problems of multi - modality and hard discontinuity that are difficult to handle in traditional methods. - Demonstrated the effectiveness of Di - SkilL in multiple complex robot simulation tasks, proving its superior performance in learning diverse skills. ### Formula Summary The key formulas involved in the paper include: - **Maximum Entropy Objective Function**: \[ \max_{\pi(\theta|c), \pi(c)} \mathbb{E}_{\pi(c)} \left[ \mathbb{E}_{\pi(\theta|c)}[R(c, \theta)] + \alpha H[\pi(\theta|c)] \right] - \beta KL(\pi(c) \| p(c)) \] where \( R(c, \theta) \) represents the reward obtained after performing an action, \( H[\pi(\theta|c)] \) is the entropy of the policy, and \( KL(\pi(c) \| p(c)) \) is the KL divergence. - **Expert Update Objective Function**: \[ \max_{\pi(\theta|c,o)} \mathbb{E}_{\pi(c|o), \pi(\theta|c,o)}[R(c, \theta) + \alpha \log \tilde{\pi}(o|c, \theta)] + \alpha \mathbb{E}_{\pi(c|o)}[H[\pi(\theta|c