Full Parameter Fine-tuning for Large Language Models with Limited Resources

Kai Lv,Yuqing Yang,Tengxiao Liu,Qinghui Gao,Qipeng Guo,Xipeng Qiu
2024-06-06
Abstract:Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting both academia and society. While existing approaches have focused on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs with limited resources. In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce memory usage to 10.8% compared to the standard approach (DeepSpeed solution). Consequently, our approach enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.Code and data are available at <a class="link-external link-https" href="https://github.com/OpenLMLab/LOMO" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
This paper focuses on the problem of fully-parameter fine-tuning of large language models (LLMs) under limited resources. Current methods mainly focus on parameter-efficient fine-tuning, i.e., adjusting or adding only a small number of parameters, but the challenge of fully-parameter fine-tuning has not been adequately addressed. The researchers propose a new optimizer called LOw-Memory Optimization (LOMO), which combines gradient computation and parameter updates into one step to reduce memory usage. By integrating existing memory-saving techniques, LOMO reduces memory consumption to 10.8% of standard methods (such as the DeepSpeed solution). This enables fully-parameter fine-tuning of a model with 65 billion parameters on a single machine with 8 RTX 3090 GPUs (each with 24GB memory). The paper analyzes four memory usage aspects during LLM training: activation, optimizer state, gradient tensors, and parameters, and optimizes these three aspects. Firstly, the researchers find that SGD (Stochastic Gradient Descent) is a suitable optimizer for fully-parameter fine-tuning of LLMs because it does not store intermediate states, thus eliminating the memory requirement for the optimizer state. Secondly, the LOMO optimizer reduces the memory usage of gradient tensors to a constant level, equivalent to the memory of the largest gradient tensor. In addition, to stabilize mixed precision training, they integrate gradient normalization, loss scaling, and converting certain computations to full precision during the training process. Experimental results show that LOMO significantly reduces memory usage, making it comparable to inference memory usage for fully-parameter fine-tuning. Furthermore, by applying it on the SuperGLUE dataset, the efficiency and effectiveness of LOMO in optimizing LLMs with billions of parameters are validated. In summary, the main contributions of the paper include: 1. Theoretical analysis that demonstrates the successful application of SGD for fully-parameter fine-tuning of LLMs, addressing previous obstacles to widespread use of SGD. 2. Designing LOMO, a new optimizer that greatly saves GPU memory without compromising the fine-tuning process. 3. Experimental evaluation that demonstrates the effectiveness of LOMO in optimizing LLMs under resource-constrained scenarios and supports its performance in downstream tasks.