Abstract:Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting both academia and society. While existing approaches have focused on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs with limited resources. In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce memory usage to 10.8% compared to the standard approach (DeepSpeed solution). Consequently, our approach enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.Code and data are available at <a class="link-external link-https" href="https://github.com/OpenLMLab/LOMO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper focuses on the problem of fully-parameter fine-tuning of large language models (LLMs) under limited resources. Current methods mainly focus on parameter-efficient fine-tuning, i.e., adjusting or adding only a small number of parameters, but the challenge of fully-parameter fine-tuning has not been adequately addressed. The researchers propose a new optimizer called LOw-Memory Optimization (LOMO), which combines gradient computation and parameter updates into one step to reduce memory usage. By integrating existing memory-saving techniques, LOMO reduces memory consumption to 10.8% of standard methods (such as the DeepSpeed solution). This enables fully-parameter fine-tuning of a model with 65 billion parameters on a single machine with 8 RTX 3090 GPUs (each with 24GB memory). The paper analyzes four memory usage aspects during LLM training: activation, optimizer state, gradient tensors, and parameters, and optimizes these three aspects. Firstly, the researchers find that SGD (Stochastic Gradient Descent) is a suitable optimizer for fully-parameter fine-tuning of LLMs because it does not store intermediate states, thus eliminating the memory requirement for the optimizer state. Secondly, the LOMO optimizer reduces the memory usage of gradient tensors to a constant level, equivalent to the memory of the largest gradient tensor. In addition, to stabilize mixed precision training, they integrate gradient normalization, loss scaling, and converting certain computations to full precision during the training process. Experimental results show that LOMO significantly reduces memory usage, making it comparable to inference memory usage for fully-parameter fine-tuning. Furthermore, by applying it on the SuperGLUE dataset, the efficiency and effectiveness of LOMO in optimizing LLMs with billions of parameters are validated. In summary, the main contributions of the paper include: 1. Theoretical analysis that demonstrates the successful application of SGD for fully-parameter fine-tuning of LLMs, addressing previous obstacles to widespread use of SGD. 2. Designing LOMO, a new optimizer that greatly saves GPU memory without compromising the fine-tuning process. 3. Experimental evaluation that demonstrates the effectiveness of LOMO in optimizing LLMs under resource-constrained scenarios and supports its performance in downstream tasks.

Full Parameter Fine-tuning for Large Language Models with Limited Resources

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Parameter-efficient Tuning for Large Language Model Without Calculating Its Gradients

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

A Study of Optimizations for Fine-tuning Large Language Models

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs

Learning Global Controller in Latent Space for Parameter-Efficient Fine-Tuning

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning

Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning