Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Junmo Kang,Leonid Karlinsky,Hongyin Luo,Zhen Wang,Jacob Hansen,James Glass,David Cox,Rameswar Panda,Rogerio Feris,Alan Ritter
2024-10-07
Abstract:We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipping a shared base LLM with distinct domain-specific capabilities, activated via self-optimized routing. This allows for dynamic and capability-specific handling of various target tasks, enhancing overall capabilities, without extensive human-labeled data and added parameters. Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-MoE demonstrates substantial improvements (6.5%p on average) over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity, the applicability of Self-MoE to multiple base LLMs, and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper aims to address how to construct large language models (LLMs) with multiple specialized capabilities under resource constraints and without significantly increasing parameters. Traditional large-scale language models are typically designed as monolithic architectures, requiring a large amount of annotated data to enhance performance in specific domains, which limits their adaptability and scalability. Additionally, these monolithic models tend to forget previously learned information, are inefficient, and lack transparency when adapting to specific tasks. Specifically, the paper raises the following question: How can we construct a composite large language model that enjoys multiple areas of expertise under resource constraints and without significantly increasing parameters? ### Solution To address the above problem, the paper proposes **Self-MoE** (Self-Mixture of Experts), a framework that transforms a monolithic model into a composite system called **MiXSE** (MiXture of Self-specialized Experts). The main features of Self-MoE include: 1. **Self-specialization**: Creating lightweight expert modules from a base LLM through self-generated synthetic data. Each expert module possesses specialized capabilities in specific domains and integrates with the shared base LLM through a self-optimizing routing mechanism. 2. **Dynamic routing**: Introducing a router module that dynamically selects the most appropriate expert module based on the input task, thereby achieving efficient handling of different tasks. 3. **Lightweight and adaptive**: The entire system achieves modularity and adaptability without relying on a large amount of human-annotated data and without significantly increasing parameters, enhancing the overall capability and interpretability of the model. ### Experimental Results Through extensive experiments on multiple benchmarks, the paper validates the effectiveness of Self-MoE: - **Performance improvement**: Self-MoE shows an average performance improvement of 6.5% over the base LLM across benchmarks in various domains such as knowledge, reasoning, mathematics, and programming. - **Outperforming other methods**: Compared to other methods like instance merging and weight merging, Self-MoE excels in flexibility and interpretability. - **Generalization ability**: Self-MoE also outperforms the base LLM in non-target domains, demonstrating good generalization ability. - **Applicability**: Self-MoE can be applied to different families and sizes of LLMs, further validating its generality and effectiveness. ### Conclusion The paper demonstrates through the Self-MoE framework how to construct a composite large language model with multiple areas of expertise under resource constraints. Self-MoE not only improves the model's performance but also enhances its adaptability and interpretability, providing new directions for future research.