Abstract:We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipping a shared base LLM with distinct domain-specific capabilities, activated via self-optimized routing. This allows for dynamic and capability-specific handling of various target tasks, enhancing overall capabilities, without extensive human-labeled data and added parameters. Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-MoE demonstrates substantial improvements (6.5%p on average) over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity, the applicability of Self-MoE to multiple base LLMs, and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address how to construct large language models (LLMs) with multiple specialized capabilities under resource constraints and without significantly increasing parameters. Traditional large-scale language models are typically designed as monolithic architectures, requiring a large amount of annotated data to enhance performance in specific domains, which limits their adaptability and scalability. Additionally, these monolithic models tend to forget previously learned information, are inefficient, and lack transparency when adapting to specific tasks. Specifically, the paper raises the following question: How can we construct a composite large language model that enjoys multiple areas of expertise under resource constraints and without significantly increasing parameters? ### Solution To address the above problem, the paper proposes **Self-MoE** (Self-Mixture of Experts), a framework that transforms a monolithic model into a composite system called **MiXSE** (MiXture of Self-specialized Experts). The main features of Self-MoE include: 1. **Self-specialization**: Creating lightweight expert modules from a base LLM through self-generated synthetic data. Each expert module possesses specialized capabilities in specific domains and integrates with the shared base LLM through a self-optimizing routing mechanism. 2. **Dynamic routing**: Introducing a router module that dynamically selects the most appropriate expert module based on the input task, thereby achieving efficient handling of different tasks. 3. **Lightweight and adaptive**: The entire system achieves modularity and adaptability without relying on a large amount of human-annotated data and without significantly increasing parameters, enhancing the overall capability and interpretability of the model. ### Experimental Results Through extensive experiments on multiple benchmarks, the paper validates the effectiveness of Self-MoE: - **Performance improvement**: Self-MoE shows an average performance improvement of 6.5% over the base LLM across benchmarks in various domains such as knowledge, reasoning, mathematics, and programming. - **Outperforming other methods**: Compared to other methods like instance merging and weight merging, Self-MoE excels in flexibility and interpretability. - **Generalization ability**: Self-MoE also outperforms the base LLM in non-target domains, demonstrating good generalization ability. - **Applicability**: Self-MoE can be applied to different families and sizes of LLMs, further validating its generality and effectiveness. ### Conclusion The paper demonstrates through the Self-MoE framework how to construct a composite large language model with multiple areas of expertise under resource constraints. Self-MoE not only improves the model's performance but also enhances its adaptability and interpretability, providing new directions for future research.

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

A Closer Look into Mixture-of-Experts in Large Language Models

Self-Specialization: Uncovering Latent Expertise within Large Language Models

Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

OLMoE: Open Mixture-of-Experts Language Models

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

A Survey on Mixture of Experts

Mixture of Diverse Size Experts

HMoE: Heterogeneous Mixture of Experts for Language Modeling

Multi-Head Mixture-of-Experts

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Toward Inference-optimal Mixture-of-Expert Large Language Models

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Memory Augmented Language Models through Mixture of Word Experts