MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

Jitai Hao,WeiWei Sun,Xin Xin,Qi Meng,Zhumin Chen,Pengjie Ren,Zhaochun Ren
2024-06-07
Abstract:Parameter-Efficient Fine-tuning (PEFT) facilitates the fine-tuning of Large Language Models (LLMs) under limited resources. However, the fine-tuning performance with PEFT on complex, knowledge-intensive tasks is limited due to the constrained model capacity, which originates from the limited number of additional trainable parameters. To overcome this limitation, we introduce a novel mechanism that fine-tunes LLMs with adapters of larger size yet memory-efficient. This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs and utilizing the larger capacity of Central Processing Unit (CPU) memory compared to Graphics Processing Unit (GPU). We store and update the parameters of larger adapters on the CPU. Moreover, we employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU. This is particularly beneficial over the limited bandwidth of PCI Express (PCIe). Our method can achieve fine-tuning results comparable to those obtained with larger memory capacities, even when operating under more limited resources such as a 24GB memory single GPU setup, with acceptable loss in training efficiency. Our codes are available at <a class="link-external link-https" href="https://github.com/CURRENTF/MEFT" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
This paper mainly discusses how to effectively fine-tune large-scale language models (LLMs) under limited resources. Current methods, such as Parameter-Efficient Fine-Tuning (PEFT), reduce the adjustment to the full parameters of large models by adding a few trainable modules. However, this method performs poorly when dealing with complex and knowledge-intensive tasks because the model capacity is limited by the number of additional trainable parameters. To overcome this limitation, the paper proposes a new mechanism called Memory-Efficient Fine-Tuning through Sparse Adapter (MEFT), which utilizes the inherent activation sparsity of Feed-Forward Networks (FFNs) in LLMs and takes advantage of the larger CPU memory capacity compared to GPU. MEFT stores and updates the larger-scale adapter parameters on the CPU while adopting a architecture similar to Mixture-of-Experts (MoE) to reduce CPU computations and communication between GPU and CPU, especially to reduce the usage of PCI Express bandwidth. Experiments show that the MEFT method achieves comparable fine-tuning results to larger memory settings even under the constraint of only 24GB GPU memory, while slightly sacrificing training efficiency. Compared to other PEFT methods such as LoRA and Parallel Adapter, MEFT performs better under the same memory capacity. In summary, the main contributions of the paper include: 1. Proposing a fine-tuning method that combines sparse activation and MoE architecture to improve memory efficiency. 2. Reducing communication overhead by limiting the number of activated neurons and transferring them to GPU. 3. Introducing a Key-Experts mechanism to partition a large number of parameters and reduce CPU computational burden. 4. Experimental results demonstrate that MEFT achieves the best results on knowledge-intensive tasks in a resource-constrained environment and performs comparably to baseline results in a resource-rich environment.