A Survey on Mixture of Experts

Weilin Cai,Juyong Jiang,Fan Wang,Jing Tang,Sunghun Kim,Jiayi Huang
2024-08-08
Abstract:Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context learning) that are not present in small models. Within this context, the mixture of experts (MoE) has emerged as an effective method for substantially scaling up model capacity with minimal computation overhead, gaining significant attention from academia and industry. Despite its growing prevalence, there lacks a systematic and comprehensive review of the literature on MoE. This survey seeks to bridge that gap, serving as an essential resource for researchers delving into the intricacies of MoE. We first briefly introduce the structure of the MoE layer, followed by proposing a new taxonomy of MoE. Next, we overview the core designs for various MoE models including both algorithmic and systemic aspects, alongside collections of available open-source implementations, hyperparameter configurations and empirical evaluations. Furthermore, we delineate the multifaceted applications of MoE in practice, and outline some potential directions for future research. To facilitate ongoing updates and the sharing of cutting-edge developments in MoE research, we have established a resource repository accessible at <a class="link-external link-https" href="https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The main aim of this paper is to address the computational resource limitations encountered by large language models (LLMs) when expanding model capacity, specifically through the Mixture of Experts (MoE) architecture. The specific objectives of the paper include: 1. **Systematic Review**: Provide a systematic and comprehensive literature review on the MoE architecture to fill the current research gap. This will help researchers better understand the latest advancements in the MoE field. 2. **New Taxonomy**: Propose a new taxonomy to organize and analyze the progress in MoE-related research. This taxonomy will categorize MoE from three perspectives: algorithm design, system design, and applications. 3. **Algorithm Design**: Delve into the design of MoE algorithms, including but not limited to: - Different types of **Gating Functions** and their design choices, such as sparse gating, dense gating, and soft gating. - The structure and configuration of **Expert Networks**. - Selection of **hyperparameters**, such as the number of experts, size, activation frequency, etc. - **Training and inference schemes**, such as Dense2Sparse and Sparse2Dense conversion strategies. 4. **System Design**: Discuss how to optimize MoE systems in terms of computation, communication, and storage to address the challenges posed by large-scale models. 5. **Practical Applications**: Outline practical application cases of MoE in natural language processing (NLP), computer vision (CV), recommendation systems (RecSys), and multimodal applications. 6. **Future Directions**: Present key challenges and opportunities for future research on MoE, and how to bridge the gap between research and practice. In summary, the goal of this paper is to provide a comprehensive guiding framework for MoE research and development, and to promote further innovation in this field.