Abstract:Finetuned large language models (LLMs) have shown remarkable performance in financial tasks, such as sentiment analysis and information retrieval. Due to privacy concerns, finetuning and deploying Financial LLMs (FinLLMs) locally are crucial for institutions. However, finetuning FinLLMs poses challenges including GPU memory constraints and long input sequences. In this paper, we employ quantized low-rank adaptation (QLoRA) to finetune FinLLMs, which leverage low-rank matrix decomposition and quantization techniques to significantly reduce computational requirements while maintaining high model performance. We also employ data and pipeline parallelism to enable local finetuning using cost-effective, widely accessible GPUs. Experiments on financial datasets demonstrate that our method achieves substantial improvements in accuracy, GPU memory usage, and time efficiency, underscoring the potential of lowrank methods for scalable and resource-efficient LLM finetuning.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to efficiently fine - tune and deploy large - scale language models in the financial field (FinLLMs) in a resource - constrained local environment to meet the complex requirements in financial tasks. Specifically, the paper mainly focuses on the following aspects: 1. **GPU Memory Limitations**: Large - scale language models usually require a large amount of GPU memory for fine - tuning and inference, which is a major challenge on resource - constrained local devices. 2. **Long Input Sequences**: The data of financial tasks usually contains long text sequences, which place higher requirements on the computational resources of the model. 3. **Privacy and Regulatory Constraints**: Due to the sensitivity of financial data, financial institutions need to perform model fine - tuning and inference in a local environment to ensure data security and compliance. To solve these problems, the paper proposes a method based on Quantized Low - Rank Adaptation (QLoRA), combined with Distributed Data Parallel (DDP) and pipeline parallel techniques, to significantly reduce the demand for computational resources while maintaining high model performance. ### Main Contributions 1. **Application of QLoRA**: Through low - rank matrix decomposition and quantization techniques, the number of trainable parameters required for fine - tuning is reduced, and the model size is compressed, thereby reducing GPU memory consumption. 2. **Optimization of Distributed Training**: Using DDP and pipeline parallel techniques, multiple GPUs are effectively utilized for training and inference, improving the training speed and efficiency. 3. **Experimental Verification**: Experiments on multiple financial datasets show that the model fine - tuned with QLoRA has significant improvements in accuracy, GPU memory usage, and time efficiency, proving the effectiveness of low - rank adaptation and quantization methods in solving the unique challenges of FinLLMs. Through these methods, the paper demonstrates the possibility of achieving efficient and low - cost fine - tuning and deployment of FinLLMs using widely available GPUs in a resource - constrained environment.

FinLoRA: Finetuning Quantized Financial Large Language Models Using Low-Rank Adaptation

QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models

Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models

Low-Rank Quantization-Aware Training for LLMs

IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

LoRA ensembles for large language model fine-tuning

QLoRA: Efficient Finetuning of Quantized LLMs

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

LoRA Learns Less and Forgets Less