Abstract:Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored to achieve memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O($2^{39}$) for Llama2-7B). To navigate such a vast design space, we formulate it and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p (\%p refers to "percentage point," which refers to the absolute difference between two percentage numbers; 74\% -> 78\% = 4\%p increase) to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale (e.g., AI agent and real-time coding assistant), where the latency is as important as the model accuracy.

What problem does this paper attempt to address?

The paper attempts to address the challenges of memory usage and computational efficiency in large language models (LLMs). Specifically, due to the vast number of parameters, large language models face significant bottlenecks in terms of memory usage and computational resources. To optimize the memory usage and computational efficiency of these models, researchers have explored various model compression methods, such as quantization and parameter pruning. However, the trade-off between accuracy and efficiency of low-rank decomposition (i.e., low-rank tensor decomposition) in large language models has not been fully understood. Therefore, this paper aims to characterize the trade-off between accuracy and efficiency by studying the application of low-rank decomposition methods (particularly Tucker decomposition) on modern language models (including the open-source Llama 2 model). The specific objectives include: 1. **Characterize the trade-off space between accuracy and efficiency**: Conduct a detailed performance analysis of Bert and Llama 2 models through six widely used large language model benchmarks (such as AI2 Reasoning Challenge, HellaSwag, MMLU, etc.), exploring the accuracy loss and computational efficiency improvement under different low-rank decomposition configurations. 2. **Formalize the design space of low-rank decomposition**: Define and describe various design choices of low-rank decomposition, including which layers and tensors to decompose, and the pruned rank of each decomposed tensor. 3. **Identify effective low-rank decomposition design choices**: Based on the performance analysis results, identify design choices that can effectively balance model accuracy and computational efficiency, providing valuable insights for future low-rank decomposition research. Through these studies, the paper hopes to provide an effective optimization method for large-scale language model applications that require real-time services (such as virtual agents, real-time coding assistants, etc.), while maintaining high model accuracy.

Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models

Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

Low-Rank Prune-And-Factorize for Language Model Compression

Accelerating the Low-Rank Decomposed Models

Data-freeWeight Compress and Denoise for Large Language Models

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Pruning Large Language Models via Accuracy Predictor

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Adaptive Feature-based Low-Rank Compression of Large Language Models Via Bayesian Optimization

The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model