Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Baohao Liao,Shaomu Tan,Christof Monz
2023-10-20
Abstract:Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning. A similar finding is also observed for the image classification task.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of further improving memory efficiency in parameter-efficient fine-tuning (PEFT) methods. Although existing PEFT methods reduce storage requirements by updating only a small number of parameters and achieve performance comparable to full fine-tuning, these methods are still not efficient enough in terms of memory usage. Specifically, existing PEFT methods still need to cache a large number of intermediate activation values during backpropagation, which leads to high memory consumption. To tackle this challenge, the authors propose a memory-efficient fine-tuning method (MEFT) that aims to significantly reduce memory usage without sacrificing performance. The core idea of MEFT is to transform the pre-trained language model (PLM) into a reversible model, thereby avoiding caching intermediate activation values during forward propagation and instead recomputing these values during backpropagation. In this way, MEFT can significantly reduce memory consumption while maintaining parameter efficiency. The authors first investigated the key factors for the success of existing PEFT methods and found that initializing newly added parameters to maintain the initial state of the PLM is crucial for performance. Based on this finding, the authors designed three MEFT methods by inserting adapters into the PLM to achieve model reversibility while keeping the initial state of the PLM unchanged. Experimental results show that the MEFT methods achieve performance comparable to full fine-tuning on multiple benchmark tasks while significantly reducing memory usage.