Unlocking Emergent Modularity in Large Language Models

Zihan Qiu,Zeyu Huang,Jie Fu
2024-04-01
Abstract:Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models. Existing MNNs are generally $\textit{explicit}$: their modular architectures are pre-defined, with individual modules expected to implement distinct functions. Recent works reveal that there exists $\textit{implicit}$ modularity in standard pre-trained transformers, namely $\textit{Emergent Modularity}$. They indicate that such modular structures spontaneously exhibit during the early pre-training phase. Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized. In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE). Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning. Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at <a class="link-external link-https" href="https://github.com/qiuzh20/EMoE" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper "Unlocking Emergent Modularity in Large Language Models" aims to address how to leverage the implicit modularity (Emergent Modularity, EM) in pre-trained language models (LMs) to improve the generalization ability of downstream tasks. Although existing modular neural networks (MNNs) exhibit numerous advantages in adaptability, data efficiency, and generalization, most language models are still treated as holistic models during pre-training and fine-tuning, with their potential modularity locked and underutilized. Therefore, this study proposes a method to convert standard language models into their Mixture-of-Experts (MoEs) versions, called Emergent MoEs (EMoE), without introducing additional parameters. In this way, the researchers hope to unlock the implicit modularity in pre-trained language models, thereby enhancing the model's performance on both in-domain and out-of-domain tasks. ### Main Contributions 1. **Unlocking Implicit Modularity**: By splitting certain feed-forward network (FFNs) layers in pre-trained language models into MoEs layers, the researchers demonstrate how to unlock the model's implicit modularity without adding extra parameters. 2. **Improving Downstream Task Performance**: Experimental results show that fine-tuning EMoE models can significantly improve the generalization performance of downstream tasks, especially on out-of-domain tasks. 3. **Robustness and Scalability**: Analysis and ablation studies indicate that the EMoE method is robust to various configurations and can be scaled to large language models (e.g., Llama2-7B and Llama-30B). 4. **Parameter Update Mechanism**: The study finds that EMoE mainly enhances performance by improving parameter updates during the fine-tuning stage rather than directly affecting the inference process. 5. **Masking Negative Transfer Effects**: EMoE can mask neurons with negative transfer effects, further enhancing the model's performance. ### Method Overview 1. **Preliminary Concepts**: - **Transformer FFNs**: FFNs layers can be viewed as key-value memories, where the input serves as the query, the first layer as the key, and the second layer as the value. - **Mixture-of-Experts (MoEs)**: By replacing the original FFNs layers and introducing gating modules, the MoEs structure is achieved. 2. **Emergent Mixture-of-Expert**: - **Cluster-based Expert Construction**: By clustering the key vectors in FFNs layers, neurons with similar activation patterns are grouped to form different experts. - **Avg-k Gating**: By averaging the key vectors of each expert to construct the gating module, experts with higher activation scores are selected to participate in the computation. 3. **Experimental Setup**: - **Models and Benchmarks**: BERT and GPT2 series models are used for evaluation, with benchmarks including GLUE and GLUE-X. - **Baseline Methods**: Including vanilla LoRA-tuning, GMoE, and EMoE-learn. ### Experimental Results 1. **BERT and GPT2**: - **ID and OOD Performance**: EMoE outperforms vanilla LoRA-tuning and GMoE on multiple tasks, especially on out-of-domain tasks. - **Stability**: EMoE shows more stable performance across different tasks, with overall results better than EMoE-learn. 2. **Llama**: - **Scalability**: EMoE can be scaled to larger models (e.g., Llama2-7B and Llama-30B) and still shows significant performance improvements without additional computational costs. ### Analysis and Discussion 1. **Does EMoE Unlock Implicit Modularity**: - By visualizing the activation patterns of neurons, the study finds that clustering based on key vectors can effectively decompose modular components in standard models. - The usage of experts in different tasks also indicates that EMoE indeed unlocks the implicit modularity in the model. 2. **How EMoE Enhances Fine-tuning Performance**: - **Parameter Updates**: EMoE mainly enhances performance through improved parameter updates during the fine-tuning stage.