Large Language Models Are Overparameterized Text Encoders

Thennal D K,Tim Fischer,Chris Biemann
2024-10-19
Abstract:Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last $p\%$ layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30\% of layers with negligible impact on performance and up to 80\% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose $\text{L}^3 \text{Prune}$, a novel layer-pruning strategy based on the model's initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21\% of the parameters with a $-0.3$ performance drop, and the small variant only suffers from a $-5.1$ decrease while pruning 74\% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that when large language models (LLMs) perform text embedding tasks, too many model parameters lead to excessively high inference time and memory requirements. Although these models can be used as powerful text embedding models after supervised contrastive training, their large scale results in huge resource consumption in practical applications. For this reason, the author proposes a simple and effective method to reduce the model size by pruning the last part of the layers of LLMs while maintaining or only slightly reducing the quality of text embedding. Specifically, the paper proposes the L3Prune method, which is a new strategy for determining the optimal pruning layers based on the initial loss of the model, aiming to minimize trial and error and achieve efficient model pruning with less performance loss. The main contributions of the paper include: - Applying the pruning technique to the text embedding scenario for the first time, providing a simple process that can be easily applied to the pipeline for converting LLM into a text encoder. - Demonstrating that LLMs can have their layers pruned by up to 30% without significantly affecting the representation quality, and even improving performance in some cases; even when pruning reaches 80%, it can still provide a reasonably effective text - embedding model. - Proposing and evaluating the L3Prune method, which identifies specific pruning layers by using the initial loss of the model, thereby minimizing the trial - and - error process required for effective pruning. L3Prune can generate a lightly - pruned model (retaining 69% - 89% of the original size, with an average performance loss of - 0.2) and a heavily - pruned model (retaining 16% - 36% of the original size, with a performance loss of - 4.4 to - 6.9). In summary, the core objective of this paper is to significantly reduce the model size of LLMs when used for text embedding tasks through the pruning technique without significantly affecting performance, thereby reducing computational and memory requirements and improving the practical application feasibility of the model.