Abstract:Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning. A similar finding is also observed for the image classification task.

Output Layer Go First: Better Fine-tuning by Bridging the Gap with Pre-training

HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Layer-wise Learning Rate Optimization for Task-Dependent Fine-Tuning of Pre-trained Models: An Evolutionary Approach

Noise-Robust Fine-Tuning of Pretrained Language Models via External Guidance

Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models

Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

MPNet: Masked and Permuted Pre-training for Language Understanding

Towards Making the Most of BERT in Neural Machine Translation

Parameter-efficient fine-tuning of large-scale pre-trained language models

Ahead-of-Time P-Tuning

SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

Efficient Fine-Tuning of Compressed Language Models with Learners

Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer with Fine-tuning Slow and Fast

Fine-tuning large neural language models for biomedical natural language processing

Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

Recent Advances in Pre-trained Language Models: Why Do They Work and How Do They Work