AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Yifan Yang,Kai Zhen,Ershad Banijamal,Athanasios Mouchtaris,Zheng Zhang

2024-06-26

Abstract:Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of excessive memory consumption during the fine-tuning of large-scale language models (LLMs). Specifically: 1. **Memory Efficiency Issue**: - Fine-tuning large-scale language models requires an increasing amount of GPU memory, especially as the model size continues to grow. - The recently proposed Memory-efficient Zeroth-order (MeZO) method fine-tunes LLMs using only forward propagation, thereby avoiding the need for backpropagation graphs. However, this method suffers from significant performance degradation and is prone to divergence. 2. **Performance and Convergence Issues**: - Zeroth-order (ZO) methods face two major challenges when fine-tuning large-scale language models: a significant performance gap compared to first-order (FO) methods, and frequent divergence issues in large-scale tasks. - Existing improvement methods such as ZO-AdaMU and Sparse-MeZO attempt to enhance the performance of ZO methods but still suffer from high memory overhead and unstable performance. To address these issues, the paper proposes the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, which aims to improve the performance and convergence of ZO methods. Specific measures include: - Introducing tensorized adapters for fast forward propagation to reduce the number of trainable parameters, thereby improving the dimension-related ZO estimation accuracy. - Proposing an adaptive query number scheduling strategy to ensure convergence in large-scale ZO fine-tuning tasks. Through theoretical analysis and experimental validation, the AdaZeta framework demonstrates excellent performance in terms of accuracy, memory efficiency, and convergence speed.

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models

Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

On the Convergence of Zeroth-Order Federated Tuning for Large Language Models

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

Full Parameter Fine-tuning for Large Language Models with Limited Resources

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

DPZero: Private Fine-Tuning of Language Models without Backpropagation

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Ziya2: Data-centric Learning is All LLMs Need

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation