SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

Priyansh Bhatnagar,Linfeng Wen,Mingu Kang
2024-11-16
Abstract:Extensive efforts have been made to boost the performance in the domain of language models by introducing various attention-based transformers. However, the inclusion of linear layers with large dimensions contributes to significant computational and memory overheads. The escalating computational demands of these models necessitate the development of various compression techniques to ensure their deployment on devices, particularly in resource-constrained environments. In this paper, we propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form. This approach automates the decision-making process to identify the optimal degree of compression for each layer. We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks. Additionally, we have validated our method on Mamba, a recently proposed state-space model. Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the efficient deployment of existing language models (LMs) on resource - constrained devices. Specifically, as the scale and complexity of language models keep increasing, their demands for computing resources and memory also increase substantially, which makes it difficult to deploy these models in resource - constrained environments such as edge devices. To address this challenge, the author proposes a new compression method, aiming to reduce the number of model parameters through adaptive low - rank approximation, thereby improving inference efficiency and reducing computing and memory costs. ### Main Research Questions 1. **Efficient Inference**: How to maintain model performance while reducing the number of parameters to achieve efficient inference in resource - constrained environments. 2. **Dynamic Adaptive Rank Decision**: How to automatically determine the optimal low - rank approximation according to the specific contribution of each layer to achieve the best compression effect and performance balance. 3. **Learnable Singular Value Threshold**: How to introduce a learnable threshold parameter to dynamically adjust the truncation of singular values during the training process, thereby maximizing the compression effect and minimizing performance loss. ### Solution Overview To solve the above problems, the paper proposes SoftLM, a language model compression method based on the soft - thresholding mechanism. The core idea of this method is achieved through the following steps: - **Low - Rank Decomposition**: Use singular value decomposition (SVD) to decompose the weight matrix of the linear layer into three modules \( U \), \( \Sigma \), and \( V \). - **Soft - Thresholding Mechanism**: Introduce a learnable threshold parameter \( \alpha \), which is used to prune small - magnitude singular values and achieve smooth gradient propagation through the soft - thresholding function \( Th_s(x) \). - **Adaptive Loss Function**: Design a total loss function \( L_{tot}=L_{acc}+\gamma\cdot L_{cmp} \) that includes accuracy and compressibility to balance performance and compression effects. Through this method, SoftLM can not only maintain high performance after compression, but also significantly reduce inference latency and memory usage, thereby achieving efficient model deployment. ### Experimental Verification The paper conducted extensive experiments on multiple language models (such as BERT, GPT2, Mamba, TinyLlama), verifying the effectiveness of the proposed method. The experimental results show that SoftLM can still maintain performance similar to the original model with a 50% reduction in parameters and has a significant improvement in inference speed. In conclusion, this paper proposes an innovative language model compression method, which solves the problem of efficient deployment of large - scale language models on resource - constrained devices.