Abstract:Reinforcement learning (RL) is a powerful approach for acquiring a good-performing policy. However, learning diverse skills is challenging in RL due to the commonly used Gaussian policy parameterization. We propose \textbf{Di}verse \textbf{Skil}l \textbf{L}earning (Di-SkilL\footnote{Videos and code are available on the project webpage: \url{<a class="link-external link-https" href="https://alrhub.github.io/di-skill-website/" rel="external noopener nofollow">this https URL</a>}}), an RL method for learning diverse skills using Mixture of Experts, where each expert formalizes a skill as a contextual motion primitive. Di-SkilL optimizes each expert and its associate context distribution to a maximum entropy objective that incentivizes learning diverse skills in similar contexts. The per-expert context distribution enables automatic curricula learning, allowing each expert to focus on its best-performing sub-region of the context space. To overcome hard discontinuities and multi-modalities without any prior knowledge of the environment's unknown context probability space, we leverage energy-based models to represent the per-expert context distributions and demonstrate how we can efficiently train them using the standard policy gradient objective. We show on challenging robot simulation tasks that Di-SkilL can learn diverse and performant skills.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to effectively learn diverse skills in Reinforcement Learning (RL)**. Specifically, the paper proposes a new method - **Di - SkilL (Diverse Skill Learning)** to overcome the challenges encountered by traditional RL methods when learning diverse skills. ### Problem Background Traditional RL methods usually use Gaussian policy parameterization, which makes it difficult to learn diverse skills. The Gaussian policy can only capture a single behavior pattern and cannot effectively represent multi - modal behaviors. Therefore, when facing complex tasks, these methods are often difficult to adapt to different situations, resulting in limited performance. ### Solutions Proposed in the Paper To address this challenge, the paper proposes **Di - SkilL**, an RL method based on the Mixture of Experts (MoE) and automatic curriculum learning. Specifically: 1. **Mixture of Experts (MoE)**: - Each expert is designed as a contextual motion primitive that can capture skills in specific situations. - By optimizing each expert and its corresponding context distribution, it encourages the learning of diverse skills in similar contexts. 2. **Automatic Curriculum Learning**: - The per - expert context distribution is introduced, allowing each expert to focus on the sub - area of the situation where it is best at. - This automatic curriculum learning mechanism allows each expert to select training samples according to its own preference, thereby improving learning efficiency. 3. **Energy - Based Models (EBM)**: - Use energy - based models to represent the context distribution of each expert, solving the problems of hard discontinuity and multi - modality that are difficult to handle in traditional methods. - EBM can be approximately trained by the real - world context distribution of the environment, avoiding the need for prior knowledge of the environment. ### Main Contributions - Proposed Di - SkilL, an RL method capable of learning diverse, non - linear skills in a continuous context space. - By introducing energy - based models and automatic curriculum learning mechanisms, solved the problems of multi - modality and hard discontinuity that are difficult to handle in traditional methods. - Demonstrated the effectiveness of Di - SkilL in multiple complex robot simulation tasks, proving its superior performance in learning diverse skills. ### Formula Summary The key formulas involved in the paper include: - **Maximum Entropy Objective Function**: \[ \max_{\pi(\theta|c), \pi(c)} \mathbb{E}_{\pi(c)} \left[ \mathbb{E}_{\pi(\theta|c)}[R(c, \theta)] + \alpha H[\pi(\theta|c)] \right] - \beta KL(\pi(c) \| p(c)) \] where \( R(c, \theta) \) represents the reward obtained after performing an action, \( H[\pi(\theta|c)] \) is the entropy of the policy, and \( KL(\pi(c) \| p(c)) \) is the KL divergence. - **Expert Update Objective Function**: \[ \max_{\pi(\theta|c,o)} \mathbb{E}_{\pi(c|o), \pi(\theta|c,o)}[R(c, \theta) + \alpha \log \tilde{\pi}(o|c, \theta)] + \alpha \mathbb{E}_{\pi(c|o)}[H[\pi(\theta|c

Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts

Diffskill: Improving Reinforcement Learning Through Diffusion-Based Skill Denoiser for Robotic Manipulation

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Robust Policy Learning via Offline Skill Diffusion

SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration

Accelerating Reinforcement Learning with Learned Skill Priors

Trying AGAIN instead of Trying Longer: Prior Learning for Automatic Curriculum Learning

Skill-Critic: Refining Learned Skills for Hierarchical Reinforcement Learning

RLIF: Interactive Imitation Learning as Reinforcement Learning

Diversity Progress for Goal Selection in Discriminability-Motivated RL

Curriculum-Based Imitation of Versatile Skills

Skill matters: Dynamic skill learning for multi-agent cooperative reinforcement learning

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Enhanced Generalization through Prioritization and Diversity in Self-Imitation Reinforcement Learning over Procedural Environments with Sparse Rewards

Density-based Curriculum for Multi-goal Reinforcement Learning with Sparse Rewards

Learning Diverse Policies with Soft Self-Generated Guidance

Exploration by Learning Diverse Skills through Successor State Measures

Generating Automatic Curricula via Self-Supervised Active Domain Randomization

Iteratively Learning Novel Strategies with Diversity Measured in State Distances

Iteratively Learn Diverse Strategies with State Distance Information

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning