Abstract:Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion, which brings considerable costs to both model training and inference. However, existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues, including hardware support limitations, the need for extensive training, and alterations to the model internal structure. In this paper, we propose a concise layer-wise structured pruner called \textit{Layer Collapse (LaCo)}, in which rear model layers collapse into a prior layer, enabling a rapid reduction in model size while preserving the model structure. Comprehensive experiments show that our method maintains an average task performance of over 80\% at pruning ratios of 25-30\%, significantly outperforming existing state-of-the-art structured pruning methods. We also conduct post-training experiments to confirm that the \textit{LaCo} effectively inherits the parameters of the original model. Additionally, we perform ablation studies on various settings of \textit{LaCo}. Finally, we discuss our motivation from the perspective of layer-wise similarity and evaluate the performance of the pruned LLMs across various pruning ratios\footnote{\url{<a class="link-external link-https" href="https://github.com/yangyifei729/LaCo" rel="external noopener nofollow">this https URL</a>}}.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper attempts to address the high training and inference costs associated with scaling large language models (LLMs). Although existing methods such as model quantization, knowledge distillation, and model pruning can reduce these costs, they each have significant drawbacks: 1. **Model Quantization**: Typically requires specific hardware support and may affect model performance. 2. **Knowledge Distillation**: Often requires retraining a smaller model, which is both expensive and task-specific. 3. **Model Pruning**: Both unstructured and structured pruning have their issues. Unstructured pruning usually leads to model sparsity, affecting performance and relying on hardware support; structured pruning involves removing specific modules, often altering the model structure and reducing its portability. To overcome these issues, the paper proposes a new hierarchical structured pruning method—**Layer Collapse (LaCo)**. The core idea of LaCo is to directly prune certain layers from a well-trained LLM and replace the parameters of multiple layers with those of a single layer, thereby achieving effective model pruning. ### Main Contributions 1. **Efficient Pruning**: LaCo can directly remove 30%-50% of the model's layers without additional training while maintaining model performance. Experimental results show that LaCo outperforms existing structured pruning methods across multiple benchmarks. 2. **Preservation of Internal Structure**: LaCo retains the internal structure of the LLM, such as intermediate dimensions, allowing the pruned model to be seamlessly integrated into existing applications without changing the system implementation. 3. **Parameter Inheritance**: Through post-training validation, LaCo can effectively inherit the parameters of the original model and requires only minimal training to return to the original model's loss convergence level. 4. **Performance Evaluation**: The paper provides a detailed evaluation of model performance under different pruning ratios and discusses the motivation and ablation studies under various settings. ### Method Overview The main steps of LaCo include: 1. **Preparation Phase**: - Define the number of layers to be merged in each operation \( C \). - Configure the operation range \([L, H]\) for the merge operations. - Set the minimum interval \( I \) between two merge operations. - Use a small set of calibration samples \( D \) for forward computation to ensure that the similarity of the pruned model's output representation to the original model is not below the threshold \( T \). 2. **Pruning Process**: - Initialize the model \( M^* \) and set the layer pointer \( l \) starting from \( H - C \). - Iterative process: - **Layer Merging**: In each iteration, merge the \( K \) layers following \( l \) into the \( l \) layer, then discard the redundant \( K \) layers. - **Similarity Calculation**: Use the calibration samples \( D \) to perform forward computation on the pruned model \( M_{\text{tmp}} \) and the original model \( M \), calculating the cosine similarity \( s \) of the output representations. - **Merge Evaluation and Adjustment**: If \( s \) exceeds the threshold \( T \), the current merge is successful, update \( M^* \) and adjust the pointer \( l \) down by \( I \) layers; otherwise, only reduce \( l \) by one layer. ### Experimental Results The paper conducts experiments on several popular LLMs, including Llama2-7B and 13B, as well as Baichuan2-7B and 13B, which support both Chinese and English. The experimental results show that LaCo performs excellently across multiple benchmarks, maintaining over 80% of the average task performance at a 25%-30% pruning ratio, significantly outperforming existing structured pruning methods. ### Conclusion LaCo is an efficient hierarchical structured pruning method that can significantly reduce the size of the model while maintaining performance without altering the internal structure of the model. Through post-training, the pruned models by LaCo can quickly recover.

LaCo: Large Language Model Pruning via Layer Collapse

LLM-Pruner: On the Structural Pruning of Large Language Models

BlockPruner: Fine-grained Pruning for Large Language Models

Reassessing Layer Pruning in LLMs: New Insights and Methods

Streamlining Redundant Layers to Compress Large Language Models

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Large Language Model Pruning

Pruning as a Domain-specific LLM Extractor

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Pruning Foundation Models for High Accuracy without Retraining

Less is More: Towards Green Code Large Language Models via Unified Structural Pruning

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models