Abstract:Extensive efforts have been made to boost the performance in the domain of language models by introducing various attention-based transformers. However, the inclusion of linear layers with large dimensions contributes to significant computational and memory overheads. The escalating computational demands of these models necessitate the development of various compression techniques to ensure their deployment on devices, particularly in resource-constrained environments. In this paper, we propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form. This approach automates the decision-making process to identify the optimal degree of compression for each layer. We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks. Additionally, we have validated our method on Mamba, a recently proposed state-space model. Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the efficient deployment of existing language models (LMs) on resource - constrained devices. Specifically, as the scale and complexity of language models keep increasing, their demands for computing resources and memory also increase substantially, which makes it difficult to deploy these models in resource - constrained environments such as edge devices. To address this challenge, the author proposes a new compression method, aiming to reduce the number of model parameters through adaptive low - rank approximation, thereby improving inference efficiency and reducing computing and memory costs. ### Main Research Questions 1. **Efficient Inference**: How to maintain model performance while reducing the number of parameters to achieve efficient inference in resource - constrained environments. 2. **Dynamic Adaptive Rank Decision**: How to automatically determine the optimal low - rank approximation according to the specific contribution of each layer to achieve the best compression effect and performance balance. 3. **Learnable Singular Value Threshold**: How to introduce a learnable threshold parameter to dynamically adjust the truncation of singular values during the training process, thereby maximizing the compression effect and minimizing performance loss. ### Solution Overview To solve the above problems, the paper proposes SoftLM, a language model compression method based on the soft - thresholding mechanism. The core idea of this method is achieved through the following steps: - **Low - Rank Decomposition**: Use singular value decomposition (SVD) to decompose the weight matrix of the linear layer into three modules \( U \), \( \Sigma \), and \( V \). - **Soft - Thresholding Mechanism**: Introduce a learnable threshold parameter \( \alpha \), which is used to prune small - magnitude singular values and achieve smooth gradient propagation through the soft - thresholding function \( Th_s(x) \). - **Adaptive Loss Function**: Design a total loss function \( L_{tot}=L_{acc}+\gamma\cdot L_{cmp} \) that includes accuracy and compressibility to balance performance and compression effects. Through this method, SoftLM can not only maintain high performance after compression, but also significantly reduce inference latency and memory usage, thereby achieving efficient model deployment. ### Experimental Verification The paper conducted extensive experiments on multiple language models (such as BERT, GPT2, Mamba, TinyLlama), verifying the effectiveness of the proposed method. The experimental results show that SoftLM can still maintain performance similar to the original model with a 50% reduction in parameters and has a significant improvement in inference speed. In conclusion, this paper proposes an innovative language model compression method, which solves the problem of efficient deployment of large - scale language models on resource - constrained devices.

SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

Cross-layer Attention Sharing for Large Language Models

Adaptive Feature-based Low-Rank Compression of Large Language Models Via Bayesian Optimization

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

Just CHOP: Embarrassingly Simple LLM Compression

Less is More! A slim architecture for optimal language translation

Hardware-oriented Algorithms for Softmax and Layer Normalization of Large Language Models

LoLCATs: On Low-Rank Linearizing of Large Language Models

Data-freeWeight Compress and Denoise for Large Language Models

Streamlining Redundant Layers to Compress Large Language Models

Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Low-Rank Prune-And-Factorize for Language Model Compression

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention