Abstract:The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert P}runing) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral $8\times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models,but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at <a class="link-external link-https" href="https://github.com/imagination-research/EEP" rel="external noopener nofollow">this https URL</a>.

ELO-Mask: Effective and Layerwise Optimization of Mask for Sparse LLMs

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Sparsity-Accelerated Training for Large Language Models

Enhancing Parameter Efficiency and Generalization in Large-Scale Models: A Regularized and Masked Low-Rank Adaptation Approach

Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models

MLAE: Masked LoRA Experts for Visual Parameter-Efficient Fine-Tuning

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Sparse Matrix in Large Language Model Fine-tuning

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Scaling Sparse Fine-Tuning to Large Language Models

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

MLAE: Masked LoRA Experts for Parameter-Efficient Fine-Tuning.

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs