Abstract:Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last $p\%$ layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30\% of layers with negligible impact on performance and up to 80\% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose $\text{L}^3 \text{Prune}$, a novel layer-pruning strategy based on the model's initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21\% of the parameters with a $-0.3$ performance drop, and the small variant only suffers from a $-5.1$ decrease while pruning 74\% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that when large language models (LLMs) perform text embedding tasks, too many model parameters lead to excessively high inference time and memory requirements. Although these models can be used as powerful text embedding models after supervised contrastive training, their large scale results in huge resource consumption in practical applications. For this reason, the author proposes a simple and effective method to reduce the model size by pruning the last part of the layers of LLMs while maintaining or only slightly reducing the quality of text embedding. Specifically, the paper proposes the L3Prune method, which is a new strategy for determining the optimal pruning layers based on the initial loss of the model, aiming to minimize trial and error and achieve efficient model pruning with less performance loss. The main contributions of the paper include: - Applying the pruning technique to the text embedding scenario for the first time, providing a simple process that can be easily applied to the pipeline for converting LLM into a text encoder. - Demonstrating that LLMs can have their layers pruned by up to 30% without significantly affecting the representation quality, and even improving performance in some cases; even when pruning reaches 80%, it can still provide a reasonably effective text - embedding model. - Proposing and evaluating the L3Prune method, which identifies specific pruning layers by using the initial loss of the model, thereby minimizing the trial - and - error process required for effective pruning. L3Prune can generate a lightly - pruned model (retaining 69% - 89% of the original size, with an average performance loss of - 0.2) and a heavily - pruned model (retaining 16% - 36% of the original size, with a performance loss of - 4.4 to - 6.9). In summary, the core objective of this paper is to significantly reduce the model size of LLMs when used for text embedding tasks through the pruning technique without significantly affecting performance, thereby reducing computational and memory requirements and improving the practical application feasibility of the model.

Large Language Models Are Overparameterized Text Encoders

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Large Language Model Pruning

Streamlining Redundant Layers to Compress Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Just CHOP: Embarrassingly Simple LLM Compression

Not All Layers of LLMs Are Necessary During Inference

LaCo: Large Language Model Pruning via Layer Collapse

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

Reassessing Layer Pruning in LLMs: New Insights and Methods

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models

Pruning Foundation Models for High Accuracy without Retraining

A Simple and Effective Pruning Approach for Large Language Models

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

The Unreasonable Ineffectiveness of the Deeper Layers