Abstract:Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose \textbf{Di}verse \textbf{Skil}l \textbf{L}earning (Di-SkilL\footnote{Videos and code are available on the project webpage: \url{<a class="link-external link-https" href="https://alrhub.github.io/di-skill-website/" rel="external noopener nofollow">this https URL</a>}}), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to effectively learn diverse skills in Reinforcement Learning (RL)**. Specifically, the paper proposes a new method - **Di - SkilL (Diverse Skill Learning)** to overcome the challenges encountered by traditional RL methods when learning diverse skills.
### Problem Background
Traditional RL methods usually use Gaussian policy parameterization, which makes it difficult to learn diverse skills. The Gaussian policy can only capture a single behavior pattern and cannot effectively represent multi - modal behaviors. Therefore, when facing complex tasks, these methods are often difficult to adapt to different situations, resulting in limited performance.
### Solutions Proposed in the Paper
To address this challenge, the paper proposes **Di - SkilL**, an RL method based on the Mixture of Experts (MoE) and automatic curriculum learning. Specifically:
1. **Mixture of Experts (MoE)**:
- Each expert is designed as a contextual motion primitive that can capture skills in specific situations.
- By optimizing each expert and its corresponding context distribution, it encourages the learning of diverse skills in similar contexts.
2. **Automatic Curriculum Learning**:
- The per - expert context distribution is introduced, allowing each expert to focus on the sub - area of the situation where it is best at.
- This automatic curriculum learning mechanism allows each expert to select training samples according to its own preference, thereby improving learning efficiency.
3. **Energy - Based Models (EBM)**:
- Use energy - based models to represent the context distribution of each expert, solving the problems of hard discontinuity and multi - modality that are difficult to handle in traditional methods.
- EBM can be approximately trained by the real - world context distribution of the environment, avoiding the need for prior knowledge of the environment.
### Main Contributions
- Proposed Di - SkilL, an RL method capable of learning diverse, non - linear skills in a continuous context space.
- By introducing energy - based models and automatic curriculum learning mechanisms, solved the problems of multi - modality and hard discontinuity that are difficult to handle in traditional methods.
- Demonstrated the effectiveness of Di - SkilL in multiple complex robot simulation tasks, proving its superior performance in learning diverse skills.
### Formula Summary
The key formulas involved in the paper include:
- **Maximum Entropy Objective Function**:
\[
\max_{\pi(\theta|c), \pi(c)} \mathbb{E}_{\pi(c)} \left[ \mathbb{E}_{\pi(\theta|c)}[R(c, \theta)] + \alpha H[\pi(\theta|c)] \right] - \beta KL(\pi(c) \| p(c))
\]
where \( R(c, \theta) \) represents the reward obtained after performing an action, \( H[\pi(\theta|c)] \) is the entropy of the policy, and \( KL(\pi(c) \| p(c)) \) is the KL divergence.
- **Expert Update Objective Function**:
\[
\max_{\pi(\theta|c,o)} \mathbb{E}_{\pi(c|o), \pi(\theta|c,o)}[R(c, \theta) + \alpha \log \tilde{\pi}(o|c, \theta)] + \alpha \mathbb{E}_{\pi(c|o)}[H[\pi(\theta|c