Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs

Afia Anjum,Maksim E. Eren,Ismael Boureima,Boian Alexandrov,Manish Bhattarai
2024-08-02
Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing (NLP) tasks, such as question-answering, sentiment analysis, text summarization, and machine translation. However, the ever-growing complexity of LLMs demands immense computational resources, hindering the broader research and application of these models. To address this, various parameter-efficient fine-tuning strategies, such as Low-Rank Approximation (LoRA) and Adapters, have been developed. Despite their potential, these methods often face limitations in compressibility. Specifically, LoRA struggles to scale effectively with the increasing number of trainable parameters in modern large scale LLMs. Additionally, Low-Rank Economic Tensor-Train Adaptation (LoRETTA), which utilizes tensor train decomposition, has not yet achieved the level of compression necessary for fine-tuning very large scale models with limited resources. This paper introduces Tensor Train Low-Rank Approximation (TT-LoRA), a novel parameter-efficient fine-tuning (PEFT) approach that extends LoRETTA with optimized tensor train (TT) decomposition integration. By eliminating Adapters and traditional LoRA-based structures, TT-LoRA achieves greater model compression without compromising downstream task performance, along with reduced inference latency and computational overhead. We conduct an exhaustive parameter search to establish benchmarks that highlight the trade-off between model compression and performance. Our results demonstrate significant compression of LLMs while maintaining comparable performance to larger models, facilitating their deployment on resource-constraint platforms.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of excessive computational resource consumption faced by large language models (LLMs) during fine-tuning. Specifically: 1. **High computational resource demand**: As the scale of LLMs continues to grow, the computational resources required for training also increase dramatically, limiting researchers' ability to apply and study these models. 2. **Limitations of existing methods**: Although there are some parameter-efficient fine-tuning strategies (such as LoRA, Adapters, etc.), these methods still have limitations in terms of compression ratio and performance, especially when dealing with ultra-large-scale models. 3. **Proposing TT-LoRA**: The paper proposes a new parameter-efficient fine-tuning method—Tensor Train Low-Rank Approximation (TT-LoRA). This method significantly reduces the number of parameters that need to be fine-tuned through tensor train decomposition techniques, thereby reducing computational overhead, and performs as well as or even better than other existing methods. In summary, the goal of the paper is to significantly reduce the computational resource requirements while ensuring model performance through the TT-LoRA method, enabling large-scale language models to be applied on resource-constrained platforms.