FinLoRA: Finetuning Quantized Financial Large Language Models Using Low-Rank Adaptation

Dannong Wang,Daniel Kim,Bo Jin,Xingjian Zhao,Tianfan Fu,Steve Yang,Xiao-Yang Liu
2024-12-16
Abstract:Finetuned large language models (LLMs) have shown remarkable performance in financial tasks, such as sentiment analysis and information retrieval. Due to privacy concerns, finetuning and deploying Financial LLMs (FinLLMs) locally are crucial for institutions. However, finetuning FinLLMs poses challenges including GPU memory constraints and long input sequences. In this paper, we employ quantized low-rank adaptation (QLoRA) to finetune FinLLMs, which leverage low-rank matrix decomposition and quantization techniques to significantly reduce computational requirements while maintaining high model performance. We also employ data and pipeline parallelism to enable local finetuning using cost-effective, widely accessible GPUs. Experiments on financial datasets demonstrate that our method achieves substantial improvements in accuracy, GPU memory usage, and time efficiency, underscoring the potential of lowrank methods for scalable and resource-efficient LLM finetuning.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to efficiently fine - tune and deploy large - scale language models in the financial field (FinLLMs) in a resource - constrained local environment to meet the complex requirements in financial tasks. Specifically, the paper mainly focuses on the following aspects: 1. **GPU Memory Limitations**: Large - scale language models usually require a large amount of GPU memory for fine - tuning and inference, which is a major challenge on resource - constrained local devices. 2. **Long Input Sequences**: The data of financial tasks usually contains long text sequences, which place higher requirements on the computational resources of the model. 3. **Privacy and Regulatory Constraints**: Due to the sensitivity of financial data, financial institutions need to perform model fine - tuning and inference in a local environment to ensure data security and compliance. To solve these problems, the paper proposes a method based on Quantized Low - Rank Adaptation (QLoRA), combined with Distributed Data Parallel (DDP) and pipeline parallel techniques, to significantly reduce the demand for computational resources while maintaining high model performance. ### Main Contributions 1. **Application of QLoRA**: Through low - rank matrix decomposition and quantization techniques, the number of trainable parameters required for fine - tuning is reduced, and the model size is compressed, thereby reducing GPU memory consumption. 2. **Optimization of Distributed Training**: Using DDP and pipeline parallel techniques, multiple GPUs are effectively utilized for training and inference, improving the training speed and efficiency. 3. **Experimental Verification**: Experiments on multiple financial datasets show that the model fine - tuned with QLoRA has significant improvements in accuracy, GPU memory usage, and time efficiency, proving the effectiveness of low - rank adaptation and quantization methods in solving the unique challenges of FinLLMs. Through these methods, the paper demonstrates the possibility of achieving efficient and low - cost fine - tuning and deployment of FinLLMs using widely available GPUs in a resource - constrained environment.