Abstract:The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper attempts to address how to develop smaller, more competitive language models using structured pruning techniques based on existing large language models (LLMs), while significantly reducing training costs. Specifically, the paper focuses on the following key issues: 1. **How to determine the final pruned architecture**: Existing structured pruning techniques (such as CoFiPruning) usually cannot specify the target structure, resulting in pruned models that perform poorly in terms of performance and inference speed. 2. **How to continue pre-training the pruned model to achieve the desired performance**: Using the original pre-training data for continued pre-training can lead to an unbalanced reduction rate of loss across different domains, thus affecting the efficient utilization of data. ### Solutions To address the above issues, the paper proposes the following two key techniques: 1. **Targeted Structured Pruning**: - By learning a set of pruning masks, the source model is pruned to a specified target architecture. The target architecture is based on the configuration of existing pre-trained models to balance model expressiveness and inference efficiency. - The pruning process is formulated as a constrained optimization problem, where pruning masks are learned to search for sub-networks that match the preset target architecture while maximizing performance. 2. **Dynamic Batch Loading**: - Dynamically adjust the proportion of data from different domains in each training batch to ensure that the loss reduction rate in each domain reaches the reference value at approximately the same time. - This method can utilize data more efficiently and accelerate overall performance improvement. ### Experimental Results The paper validates the effectiveness of the proposed methods through experiments: - **Downstream Task Performance**: The Sheared-LLaMA model performs excellently on multiple downstream tasks, surpassing other open-source models of the same scale, such as Pythia, INCITE, OpenLLaMA, etc. - **Instruction Tuning**: In terms of instruction tuning, the Sheared-LLaMA model also shows a higher win rate, indicating its ability to generate long, coherent, and information-rich responses. - **Data Utilization**: The dynamic batch loading method ensures more balanced data utilization across different domains, improving performance on downstream tasks. ### Conclusion The paper provides strong evidence that developing smaller, high-performance language models using structured pruning techniques based on existing large language models is a more cost-effective approach. This method not only significantly reduces training costs but also achieves excellent performance on multiple downstream tasks.

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

LLM-Pruner: On the Structural Pruning of Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

BlockPruner: Fine-grained Pruning for Large Language Models

LaCo: Large Language Model Pruning via Layer Collapse

A Simple and Effective Pruning Approach for Large Language Models

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Pruning Foundation Models for High Accuracy without Retraining

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

Reassessing Layer Pruning in LLMs: New Insights and Methods

Structured Pruning of Large Language Models

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models