Abstract:The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

What problem does this paper attempt to address?

This paper attempts to solve the problem of excessive computational requirements of large - language models (LLMs) in natural - language processing. Specifically, although large - language models such as LLaMA and GPT perform excellently in various complex language benchmark tests, they require a large amount of computational resources, which limits their wide application. To solve this problem, the author proposes a new framework - SparseLLM, aiming to introduce sparsity through global pruning techniques to improve memory and computational efficiency. ### Main problems 1. **High consumption of computational resources**: Due to the large number of parameters, large - language models usually require significant computational resources to run, making it difficult to deploy and use them in resource - limited environments. 2. **Limitations of traditional global pruning**: Traditional global pruning methods need to load the entire model onto the same GPU, which is impractical for modern large - language models because these models are too large. 3. **Sub - optimal solution of local pruning**: Although the local pruning method compresses each layer separately and can reduce the demand for computational resources, this method only focuses on minimizing local errors, resulting in a decline in the overall model performance, especially in the case of high sparsity. ### Solutions SparseLLM redefines the global pruning process and decomposes it into several manageable sub - problems, thereby achieving resource - efficient optimization and maintaining global optimality. Specifically: - **Modular function chain**: Consider the large - language model as a chain of modular functions, where the output of each module serves as the input of the next module. - **Auxiliary variables**: Introduce auxiliary variables to decompose the problem so that each sub - problem can be solved in a low - resource environment, and these sub - problems can be coordinated to achieve the global pruning goal. - **Alternating optimization algorithm**: Propose an alternating optimization algorithm to efficiently solve these sub - problems through the closed - form solutions of each sub - problem, thereby achieving global optimality. ### Experimental results The experimental results show that SparseLLM significantly outperforms existing local pruning methods, such as SparseGPT and Wanda, in the case of high sparsity (> 60%). In particular, on large - scale models (such as OPT - 66b), SparseLLM can significantly reduce perplexity and improve the compression effect of the model. ### Conclusion SparseLLM provides an effective solution. Through global pruning techniques, it significantly reduces the computational resource requirements of large - language models while maintaining high performance, making them easier to deploy and use in resource - constrained environments.

SparseLLM: Towards Global Pruning for Pre-trained Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

SlimGPT: Layer-wise Structured Pruning for Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Structured Optimal Brain Pruning for Large Language Models

Pruning as a Domain-specific LLM Extractor

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity