Abstract:Recent research has demonstrated that Feed-Forward Networks (FFNs) in Large Language Models (LLMs) play a pivotal role in storing diverse linguistic and factual knowledge. Conventional methods frequently face challenges due to knowledge confusion stemming from their monolithic and redundant architectures, which calls for more efficient solutions with minimal computational overhead, particularly for LLMs. In this paper, we explore the FFN computation paradigm in LLMs and introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications, while maintaining the same level of performance. Furthermore, we embed a router from the Mixture-of-Experts (MoE), combined with our devised Prior-Approximate (PA) loss term that facilitates the dynamic activation of experts and knowledge adaptation, thereby accelerating computational processes and enhancing performance using minimal training data and fine-tuning steps. FactorLLM thus enables efficient knowledge factorization and activates select groups of experts specifically tailored to designated tasks, emulating the interactive functional segmentation of the human brain. Extensive experiments across various benchmarks demonstrate the effectiveness of our proposed FactorLLM which achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed. Code: <a class="link-external link-https" href="https://github.com/zhenwuweihe/FactorLLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the efficiency problems of large - language models (LLMs) in knowledge storage and processing. Specifically, existing large - language models usually adopt monolithic feed - forward networks (FFNs), which leads to the problems of redundant parameters and waste of computational resources in the model for specific tasks. These problems not only increase the training and inference time of the model, but may also lead to the "hallucination" phenomenon, that is, the model uses knowledge irrelevant to specific tasks for inference. To address these challenges, the paper proposes **FactorLLM**, a novel method that decomposes the trained dense FFNs into sparse sub - networks while maintaining the same performance level. In addition, FactorLLM introduces the router in the Mixture - of - Experts (MoE) architecture and combines the designed Prior - Approximate (PA) loss term to dynamically activate experts and adapt to knowledge, thereby accelerating the calculation process and improving performance. This method can not only efficiently decompose knowledge, but also activate specialized expert groups according to specific tasks, simulating the functional division of the human brain. ### Main contributions 1. **Propose a simple and effective method, FactorLLM**, which decomposes the dense FFN in large - language models into a mixture of experts to improve inference efficiency while retaining the performance of the original model on specific tasks. 2. **Introduce the Prior - Approximate Router (PAR)**. Utilize the existing prior knowledge in LLM to jointly fine - tune only the injected router and decomposed experts, promoting parameter - and data - efficient LLM adaptation to specific knowledge domains. 3. **Conduct extensive evaluations** to verify the effectiveness and robustness of FactorLLM under multiple model architectures. Research shows that FactorLLM can significantly reduce more than 30% of the computational overhead while maintaining a prediction accuracy of more than 85%. ### Key technologies of the solution 1. **Model decomposition**: Decompose the pre - trained FFN into multiple sub - networks, and each sub - network is responsible for processing a specific type of knowledge. This decomposition method ensures that the performance is not affected by reorganizing the weight matrix without modifying any values or omitting information. 2. **Mixture - of - experts architecture**: Treat the decomposed sub - networks as experts and use the sparse structure to achieve acceleration during the inference process. Dynamically activate specific experts through randomly initialized routers to improve the efficiency of using specialized knowledge. 3. **Prior - Approximate Router (PAR)**: In the teacher - student framework, use the prior knowledge of the teacher model to generate pseudo - assignments to guide the student model to quickly learn the expert activation strategy. PAR guides the router to select the expert closest to the teacher model's knowledge by minimizing the difference in expert selection. 4. **Optimization objective**: Combining the requirements of task - specific knowledge adaptation, propose a comprehensive optimization objective, including the fine - tuning loss and the custom - defined PA loss term, to balance the generalization ability of the model and the expertise of experts. ### Experimental results 1. **Performance improvement**: FactorLLM performs well in multiple benchmark tests. Especially on the boolq dataset, FactorLLM - 3K directly fine - tuned exceeds the known upper limit, increasing by 3.9% and 1.7% on TinyLlama and MobileLlama respectively. 2. **Computational efficiency**: FactorLLM significantly reduces the GFLOPs in the inference process, especially in the 1R4E1K configuration, reducing the amount of computation by about 75%. Even in the most efficient configuration, FactorLLM can outperform the performance of MoEfication on multiple datasets. 3. **Data efficiency**: FactorLLM can maintain more than 85% of the original model performance when using only 0.03 - 0.04% of the training data. This is particularly advantageous for scenarios where it is difficult to obtain a large amount of labeled data. ### Conclusion FactorLLM successfully solves the efficiency problems of large - language models in knowledge storage and processing by decomposing FFN and introducing the mixture - of - experts architecture. This method not only improves the inference speed of the model, but also maintains high accuracy, providing an effective solution for LLM applications in resource - constrained environments.

FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Toward Inference-optimal Mixture-of-Expert Large Language Models

Knowledge Fusion of Large Language Models

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Factorization of Language Models through Backing-Off Lattices

Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning

eFedLLM: Efficient LLM Inference Based on Federated Learning

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

SciDFM: A Large Language Model with Mixture-of-Experts for Science

FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts

LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models

LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting

AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Unconstrained Model Merging for Enhanced LLM Reasoning

A Closer Look into Mixture-of-Experts in Large Language Models

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts