A Survey on Mixture of Experts

Weilin Cai,Juyong Jiang,Fan Wang,Jing Tang,Sunghun Kim,Jiayi Huang

2024-08-08

Abstract:Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context learning) that are not present in small models. Within this context, the mixture of experts (MoE) has emerged as an effective method for substantially scaling up model capacity with minimal computation overhead, gaining significant attention from academia and industry. Despite its growing prevalence, there lacks a systematic and comprehensive review of the literature on MoE. This survey seeks to bridge that gap, serving as an essential resource for researchers delving into the intricacies of MoE. We first briefly introduce the structure of the MoE layer, followed by proposing a new taxonomy of MoE. Next, we overview the core designs for various MoE models including both algorithmic and systemic aspects, alongside collections of available open-source implementations, hyperparameter configurations and empirical evaluations. Furthermore, we delineate the multifaceted applications of MoE in practice, and outline some potential directions for future research. To facilitate ongoing updates and the sharing of cutting-edge developments in MoE research, we have established a resource repository accessible at <a class="link-external link-https" href="https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The main aim of this paper is to address the computational resource limitations encountered by large language models (LLMs) when expanding model capacity, specifically through the Mixture of Experts (MoE) architecture. The specific objectives of the paper include: 1. **Systematic Review**: Provide a systematic and comprehensive literature review on the MoE architecture to fill the current research gap. This will help researchers better understand the latest advancements in the MoE field. 2. **New Taxonomy**: Propose a new taxonomy to organize and analyze the progress in MoE-related research. This taxonomy will categorize MoE from three perspectives: algorithm design, system design, and applications. 3. **Algorithm Design**: Delve into the design of MoE algorithms, including but not limited to: - Different types of **Gating Functions** and their design choices, such as sparse gating, dense gating, and soft gating. - The structure and configuration of **Expert Networks**. - Selection of **hyperparameters**, such as the number of experts, size, activation frequency, etc. - **Training and inference schemes**, such as Dense2Sparse and Sparse2Dense conversion strategies. 4. **System Design**: Discuss how to optimize MoE systems in terms of computation, communication, and storage to address the challenges posed by large-scale models. 5. **Practical Applications**: Outline practical application cases of MoE in natural language processing (NLP), computer vision (CV), recommendation systems (RecSys), and multimodal applications. 6. **Future Directions**: Present key challenges and opportunities for future research on MoE, and how to bridge the gap between research and practice. In summary, the goal of this paper is to provide a comprehensive guiding framework for MoE research and development, and to promote further innovation in this field.

A Survey on Mixture of Experts

A Closer Look into Mixture-of-Experts in Large Language Models

A Survey on Inference Optimization Techniques for Mixture of Experts Models

HMoE: Heterogeneous Mixture of Experts for Language Modeling

Mixture of Diverse Size Experts

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

OLMoE: Open Mixture-of-Experts Language Models

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

Toward Inference-optimal Mixture-of-Expert Large Language Models

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach

Semi-Supervised Learning of Noisy Mixture of Experts Models

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Learning Mixtures of Experts with EM