Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models

Chakshu Moar,Faraz Tahmasebi,Michael Pellauer,Hyoukjun Kwon
2024-10-23
Abstract:Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored to achieve memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O($2^{39}$) for Llama2-7B). To navigate such a vast design space, we formulate it and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p (\%p refers to "percentage point," which refers to the absolute difference between two percentage numbers; 74\% -> 78\% = 4\%p increase) to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale (e.g., AI agent and real-time coding assistant), where the latency is as important as the model accuracy.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the challenges of memory usage and computational efficiency in large language models (LLMs). Specifically, due to the vast number of parameters, large language models face significant bottlenecks in terms of memory usage and computational resources. To optimize the memory usage and computational efficiency of these models, researchers have explored various model compression methods, such as quantization and parameter pruning. However, the trade-off between accuracy and efficiency of low-rank decomposition (i.e., low-rank tensor decomposition) in large language models has not been fully understood. Therefore, this paper aims to characterize the trade-off between accuracy and efficiency by studying the application of low-rank decomposition methods (particularly Tucker decomposition) on modern language models (including the open-source Llama 2 model). The specific objectives include: 1. **Characterize the trade-off space between accuracy and efficiency**: Conduct a detailed performance analysis of Bert and Llama 2 models through six widely used large language model benchmarks (such as AI2 Reasoning Challenge, HellaSwag, MMLU, etc.), exploring the accuracy loss and computational efficiency improvement under different low-rank decomposition configurations. 2. **Formalize the design space of low-rank decomposition**: Define and describe various design choices of low-rank decomposition, including which layers and tensors to decompose, and the pruned rank of each decomposed tensor. 3. **Identify effective low-rank decomposition design choices**: Based on the performance analysis results, identify design choices that can effectively balance model accuracy and computational efficiency, providing valuable insights for future low-rank decomposition research. Through these studies, the paper hopes to provide an effective optimization method for large-scale language model applications that require real-time services (such as virtual agents, real-time coding assistants, etc.), while maintaining high model accuracy.