Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

Yixin Ji,Yang Xiang,Juntao Li,Wei Chen,Zhongyi Liu,Kehai Chen,Min Zhang
2024-05-17
Abstract:In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the challenges faced by large - language models (LLMs) during low - rank compression (LRC). Specifically, the paper focuses on the following main issues: 1. **High cost of computational resources**: As the scale of LLMs continues to increase, the demand for computational resources also increases significantly. Therefore, it is necessary to reduce the consumption of computational resources while maintaining the performance of the model. 2. **Effectiveness of low - rank compression**: The existing low - rank compression methods are not fully applied in LLMs, especially lacking research on how to effectively allocate low - rank dimensions. 3. **Complexity of feature space**: The feature space of LLMs usually has high - dimensional characteristics, which makes the estimation of feature distribution complex and has the problem of outlier interference. 4. **Differences in sensitivity of different layers**: Different model layers have different sensitivities to low - rank compression. Therefore, a method is needed to reasonably allocate low - rank dimensions to minimize performance loss. ### Solutions To solve the above problems, the paper proposes the following methods: 1. **Feature - based low - rank decomposition**: - Use the pooled covariance matrix (PCM) to estimate the feature distribution more accurately, thereby overcoming the statistical challenges brought by the high - dimensional feature space. - Find the optimal low - rank matrix through principal component analysis (PCA) to achieve efficient feature low - rank decomposition. 2. **Bayesian - optimized low - rank dimension allocation**: - Utilize Bayesian optimization (BO) to determine the low - rank dimensions of different layers to achieve the optimal performance under the target compression rate. - Quantify the difference between the compressed model and the original model prediction distribution through the reverse KL divergence (RKL) to further optimize the allocation of low - rank dimensions. 3. **Post - training**: - After low - rank compression, use a small number of parameters and data for post - training to further narrow the performance gap between the compressed model and the original model. ### Experimental results The experimental results show that this method performs well in multiple benchmark tests: - **Language modeling ability**: At a compression rate of 20%, Bolaco (5×4) achieves the best language modeling performance in the 7b model. Although in the 13b model, Bolaco (5×4) still leads other compression techniques, there is still a certain gap compared with FLAP. - **Zero - sample task performance**: In zero - sample tasks, the Bolaco method significantly outperforms all baseline methods, with an average performance improvement of 1.5 - 2%. After post - training, the performance of the compressed model is further close to the original model, retaining 96% - 98% of the original performance. ### Conclusion By proposing a feature - based low - rank compression method and a Bayesian - optimized low - rank dimension allocation strategy, the paper effectively addresses the challenges of LLMs in low - rank compression and significantly reduces the demand for computational resources while maintaining performance.