Abstract:Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model's output -- contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance -- without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in large - scale language models (LLMs): 1. **Huge consumption of computing resources**: Current large - scale language models usually contain billions of parameters, which leads to a large amount of computing resources, such as multiple GPUs, being required when deploying and using these models, and the model has a long inference latency. 2. **Environmental impact**: Due to the large amount of energy consumed during the training and inference processes of large - scale language models, significant environmental concerns have been raised. 3. **Improvement of model efficiency**: In order to address the above problems, researchers have been looking for ways to improve the efficiency of large - scale language models, such as techniques like model distillation, quantization and pruning. In particular, model pruning reduces the size of the model by removing certain components in the model while trying to maintain the model performance as much as possible. In response to these problems, the paper proposes **FINER CUT**, a new fine - grained layer - pruning method. Different from previous works that mainly perform pruning at the Transformer block level, FINER CUT considers all self - attention and feed - forward network (FFN) layers within the block as independent pruning candidates. Specifically, FINER CUT selects those layers that have the least impact on the model output after removal for pruning, thus contributing a new, streamlined, interpretable, and task - independent pruning method. The main contributions of the paper include: - Proposing a new layer - pruning method, FINER CUT, which can handle self - attention layers and FFN layers in a fine - grained manner. - Introducing a new model - pruning formula, aiming to minimize the impact of pruning on the model output, which is evaluated by measuring the change in the prediction distribution. - Experimental results show that FINER CUT can significantly reduce the computational amount of the model without fine - tuning or reconstruction after pruning while maintaining high performance. - By analyzing the pruned layers, FINER CUT also provides a tool to study the mechanistic interpretability of large - scale language models, finding that the deep self - attention layers are more redundant than the FFN layers, which provides new ideas for designing more efficient large - scale language model architectures in the future. Overall, FINER CUT not only helps to solve the computing and environmental problems of large - scale language models but also provides valuable insights into understanding the internal mechanisms of these models.

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

BlockPruner: Fine-grained Pruning for Large Language Models

Reassessing Layer Pruning in LLMs: New Insights and Methods

The Unreasonable Ineffectiveness of the Deeper Layers

Large Language Models Are Overparameterized Text Encoders

Pruning Foundation Models for High Accuracy without Retraining

LLM-Pruner: On the Structural Pruning of Large Language Models

LaCo: Large Language Model Pruning via Layer Collapse

PAT: Pruning-Aware Tuning for Large Language Models

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Structured Pruning Learns Compact and Accurate Models

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

Streamlining Redundant Layers to Compress Large Language Models

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Large Language Model Pruning

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models