AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Yifan Yang,Kai Zhen,Ershad Banijamal,Athanasios Mouchtaris,Zheng Zhang
2024-06-26
Abstract:Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issue of excessive memory consumption during the fine-tuning of large-scale language models (LLMs). Specifically: 1. **Memory Efficiency Issue**: - Fine-tuning large-scale language models requires an increasing amount of GPU memory, especially as the model size continues to grow. - The recently proposed Memory-efficient Zeroth-order (MeZO) method fine-tunes LLMs using only forward propagation, thereby avoiding the need for backpropagation graphs. However, this method suffers from significant performance degradation and is prone to divergence. 2. **Performance and Convergence Issues**: - Zeroth-order (ZO) methods face two major challenges when fine-tuning large-scale language models: a significant performance gap compared to first-order (FO) methods, and frequent divergence issues in large-scale tasks. - Existing improvement methods such as ZO-AdaMU and Sparse-MeZO attempt to enhance the performance of ZO methods but still suffer from high memory overhead and unstable performance. To address these issues, the paper proposes the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, which aims to improve the performance and convergence of ZO methods. Specific measures include: - Introducing tensorized adapters for fast forward propagation to reduce the number of trainable parameters, thereby improving the dimension-related ZO estimation accuracy. - Proposing an adaptive query number scheduling strategy to ensure convergence in large-scale ZO fine-tuning tasks. Through theoretical analysis and experimental validation, the AdaZeta framework demonstrates excellent performance in terms of accuracy, memory efficiency, and convergence speed.