Abstract:The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts. The hope is that if the total parameter count of the small experts equals that of the singular large expert, then we retain the representation power of the large expert while gaining computational tractability and promoting expert specialization. The recently introduced Soft MoE replaces the Sparse MoE's discrete routing mechanism with a differentiable gating function that smoothly mixes tokens. While this smooth gating function successfully mitigates the various training instabilities associated with Sparse MoE, it is unclear whether it induces implicit biases that affect Soft MoE's representation power or potential for expert specialization. We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE's success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). Continuing along this line of investigation, we introduce a notion of expert specialization for Soft MoE, and while varying the number of experts yet fixing the total parameter count, we consider the following (computationally intractable) task. Given any input, how can we discover the expert subset that is specialized to predict this input's label? We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores the issues faced by the Soft Mixture of Experts (Soft MoE) architecture when dealing with large-scale models and attempts to answer the following two core questions: 1. **Can a single expert represent simple functions?** - The traditional view suggests that by splitting a large expert into multiple smaller experts (keeping the total number of parameters constant), one can retain the expressiveness of the large expert while improving computational efficiency. However, for Soft MoE, it is unclear whether this view holds. The paper theoretically demonstrates that even if a single expert has an arbitrary number of parameters, Soft MoE cannot represent simple convex functions. This indicates that multiple experts are necessary to enhance expressiveness, even if the total number of parameters remains the same. 2. **Does Soft MoE exhibit expert specialization?** - Since Soft MoE mixes the input before passing it to the experts, it is theoretically challenging to achieve expert specialization. The paper proposes a new method to discover effective subsets of experts for specific inputs and experimentally shows that as the number of experts increases (even with a fixed total number of parameters), there is an implicit bias in the architecture that allows for efficient approximation of these specialized expert subsets. ### Summary - **Theoretical Contribution**: The paper proves that a single expert cannot represent simple convex functions, challenging the traditional view and indicating that multiple experts are crucial for enhancing expressiveness. - **Experimental Contribution**: The paper proposes a method for discovering specialized expert subsets and validates its effectiveness through experiments, especially when the number of experts is large. This provides a potential way to reduce computational load during inference.

Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study

From Sparse to Soft Mixtures of Experts

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Mixture of Diverse Size Experts

MoEC: Mixture of Expert Clusters

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

A Closer Look into Mixture-of-Experts in Large Language Models

Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

On Least Square Estimation in Softmax Gating Mixture of Experts

Implicit Mixture of Interpretable Experts for Global and Local Interpretability

Multi-Head Mixture-of-Experts

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models.