Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Baohao Liao,Shaomu Tan,Christof Monz

2023-10-20

Abstract:Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning. A similar finding is also observed for the image classification task.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of further improving memory efficiency in parameter-efficient fine-tuning (PEFT) methods. Although existing PEFT methods reduce storage requirements by updating only a small number of parameters and achieve performance comparable to full fine-tuning, these methods are still not efficient enough in terms of memory usage. Specifically, existing PEFT methods still need to cache a large number of intermediate activation values during backpropagation, which leads to high memory consumption. To tackle this challenge, the authors propose a memory-efficient fine-tuning method (MEFT) that aims to significantly reduce memory usage without sacrificing performance. The core idea of MEFT is to transform the pre-trained language model (PLM) into a reversible model, thereby avoiding caching intermediate activation values during forward propagation and instead recomputing these values during backpropagation. In this way, MEFT can significantly reduce memory consumption while maintaining parameter efficiency. The authors first investigated the key factors for the success of existing PEFT methods and found that initializing newly added parameters to maintain the initial state of the PLM is crucial for performance. Based on this finding, the authors designed three MEFT methods by inserting adapters into the PLM to achieve model reversibility while keeping the initial state of the PLM unchanged. Experimental results show that the MEFT methods achieve performance comparable to full fine-tuning on multiple benchmark tasks while significantly reducing memory usage.

Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers

Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter

SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models

READ: Recurrent Adaptation of Large Transformers

Efficient Fine-Tuning of BERT Models on the Edge

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Advancing Parameter Efficiency in Fine-tuning via Representation Editing

Output Layer Go First: Better Fine-tuning by Bridging the Gap with Pre-training

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

One Network, Many Masks: Towards More Parameter-Efficient Transfer Learning