Abstract:Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models. Existing MNNs are generally $\textit{explicit}$: their modular architectures are pre-defined, with individual modules expected to implement distinct functions. Recent works reveal that there exists $\textit{implicit}$ modularity in standard pre-trained transformers, namely $\textit{Emergent Modularity}$. They indicate that such modular structures spontaneously exhibit during the early pre-training phase. Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized. In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE). Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning. Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at <a class="link-external link-https" href="https://github.com/qiuzh20/EMoE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "Unlocking Emergent Modularity in Large Language Models" aims to address how to leverage the implicit modularity (Emergent Modularity, EM) in pre-trained language models (LMs) to improve the generalization ability of downstream tasks. Although existing modular neural networks (MNNs) exhibit numerous advantages in adaptability, data efficiency, and generalization, most language models are still treated as holistic models during pre-training and fine-tuning, with their potential modularity locked and underutilized. Therefore, this study proposes a method to convert standard language models into their Mixture-of-Experts (MoEs) versions, called Emergent MoEs (EMoE), without introducing additional parameters. In this way, the researchers hope to unlock the implicit modularity in pre-trained language models, thereby enhancing the model's performance on both in-domain and out-of-domain tasks. ### Main Contributions 1. **Unlocking Implicit Modularity**: By splitting certain feed-forward network (FFNs) layers in pre-trained language models into MoEs layers, the researchers demonstrate how to unlock the model's implicit modularity without adding extra parameters. 2. **Improving Downstream Task Performance**: Experimental results show that fine-tuning EMoE models can significantly improve the generalization performance of downstream tasks, especially on out-of-domain tasks. 3. **Robustness and Scalability**: Analysis and ablation studies indicate that the EMoE method is robust to various configurations and can be scaled to large language models (e.g., Llama2-7B and Llama-30B). 4. **Parameter Update Mechanism**: The study finds that EMoE mainly enhances performance by improving parameter updates during the fine-tuning stage rather than directly affecting the inference process. 5. **Masking Negative Transfer Effects**: EMoE can mask neurons with negative transfer effects, further enhancing the model's performance. ### Method Overview 1. **Preliminary Concepts**: - **Transformer FFNs**: FFNs layers can be viewed as key-value memories, where the input serves as the query, the first layer as the key, and the second layer as the value. - **Mixture-of-Experts (MoEs)**: By replacing the original FFNs layers and introducing gating modules, the MoEs structure is achieved. 2. **Emergent Mixture-of-Expert**: - **Cluster-based Expert Construction**: By clustering the key vectors in FFNs layers, neurons with similar activation patterns are grouped to form different experts. - **Avg-k Gating**: By averaging the key vectors of each expert to construct the gating module, experts with higher activation scores are selected to participate in the computation. 3. **Experimental Setup**: - **Models and Benchmarks**: BERT and GPT2 series models are used for evaluation, with benchmarks including GLUE and GLUE-X. - **Baseline Methods**: Including vanilla LoRA-tuning, GMoE, and EMoE-learn. ### Experimental Results 1. **BERT and GPT2**: - **ID and OOD Performance**: EMoE outperforms vanilla LoRA-tuning and GMoE on multiple tasks, especially on out-of-domain tasks. - **Stability**: EMoE shows more stable performance across different tasks, with overall results better than EMoE-learn. 2. **Llama**: - **Scalability**: EMoE can be scaled to larger models (e.g., Llama2-7B and Llama-30B) and still shows significant performance improvements without additional computational costs. ### Analysis and Discussion 1. **Does EMoE Unlock Implicit Modularity**: - By visualizing the activation patterns of neurons, the study finds that clustering based on key vectors can effectively decompose modular components in standard models. - The usage of experts in different tasks also indicates that EMoE indeed unlocks the implicit modularity in the model. 2. **How EMoE Enhances Fine-tuning Performance**: - **Parameter Updates**: EMoE mainly enhances performance through improved parameter updates during the fine-tuning stage.

Unlocking Emergent Modularity in Large Language Models

A Closer Look into Mixture-of-Experts in Large Language Models

MMNMT: Modularizing Multilingual Neural Machine Translation with Flexibly Assembled MoE and Dense Blocks

MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model

Llama 3 Meets MoE: Efficient Upcycling

Emergent Modularity in Pre-trained Transformers

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Exploring Neuron Interactions and Emergence in LLMs: From the Multifractal Analysis Perspective

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

Neuron Specialization: Leveraging intrinsic task modularity for multilingual machine translation

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

MoD: A Distribution-Based Approach for Merging Large Language Models

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training