Abstract:In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the challenges faced by large - language models (LLMs) during low - rank compression (LRC). Specifically, the paper focuses on the following main issues: 1. **High cost of computational resources**: As the scale of LLMs continues to increase, the demand for computational resources also increases significantly. Therefore, it is necessary to reduce the consumption of computational resources while maintaining the performance of the model. 2. **Effectiveness of low - rank compression**: The existing low - rank compression methods are not fully applied in LLMs, especially lacking research on how to effectively allocate low - rank dimensions. 3. **Complexity of feature space**: The feature space of LLMs usually has high - dimensional characteristics, which makes the estimation of feature distribution complex and has the problem of outlier interference. 4. **Differences in sensitivity of different layers**: Different model layers have different sensitivities to low - rank compression. Therefore, a method is needed to reasonably allocate low - rank dimensions to minimize performance loss. ### Solutions To solve the above problems, the paper proposes the following methods: 1. **Feature - based low - rank decomposition**: - Use the pooled covariance matrix (PCM) to estimate the feature distribution more accurately, thereby overcoming the statistical challenges brought by the high - dimensional feature space. - Find the optimal low - rank matrix through principal component analysis (PCA) to achieve efficient feature low - rank decomposition. 2. **Bayesian - optimized low - rank dimension allocation**: - Utilize Bayesian optimization (BO) to determine the low - rank dimensions of different layers to achieve the optimal performance under the target compression rate. - Quantify the difference between the compressed model and the original model prediction distribution through the reverse KL divergence (RKL) to further optimize the allocation of low - rank dimensions. 3. **Post - training**: - After low - rank compression, use a small number of parameters and data for post - training to further narrow the performance gap between the compressed model and the original model. ### Experimental results The experimental results show that this method performs well in multiple benchmark tests: - **Language modeling ability**: At a compression rate of 20%, Bolaco (5×4) achieves the best language modeling performance in the 7b model. Although in the 13b model, Bolaco (5×4) still leads other compression techniques, there is still a certain gap compared with FLAP. - **Zero - sample task performance**: In zero - sample tasks, the Bolaco method significantly outperforms all baseline methods, with an average performance improvement of 1.5 - 2%. After post - training, the performance of the compressed model is further close to the original model, retaining 96% - 98% of the original performance. ### Conclusion By proposing a feature - based low - rank compression method and a Bayesian - optimized low - rank dimension allocation strategy, the paper effectively addresses the challenges of LLMs in low - rank compression and significantly reduces the demand for computational resources while maintaining performance.

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

Adaptive Feature-based Low-Rank Compression of Large Language Models Via Bayesian Optimization

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

A Survey on Model Compression for Large Language Models

Data-freeWeight Compress and Denoise for Large Language Models

Low-Rank Prune-And-Factorize for Language Model Compression

Ranking LLMs by compression

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Aggressive Post-Training Compression on Extremely Large Language Models

LCQ: Low-Rank Codebook based Quantization for Large Language Models

Large Language Models to Enhance Bayesian Optimization

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

Search for Efficient Large Language Models

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

Bayesian Low-rank Adaptation for Large Language Models

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Convolutional Neural Network Compression Based on Low-Rank Decomposition