Abstract:Fine-tuning pre-trained models has been ubiquitously proven to be effective in a wide range of NLP tasks. However, fine-tuning the whole model is parameter inefficient as it always yields an entirely new model for each task. Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks. These methods achieve surprisingly good performance and are shown to be more stable than their corresponding fully fine-tuned counterparts. However, such kind of methods is still not well understood. Some natural questions arise: How does the parameter sparsity lead to promising performance? Why is the model more stable than the fully fine-tuned models? How to choose the tunable parameters? In this paper, we first categorize the existing methods into random approaches, rule-based approaches, and projection-based approaches based on how they choose which parameters to tune. Then, we show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them. We indicate that the sparsity is actually imposing a regularization on the original model by controlling the upper bound of the stability. Such stability leads to better generalization capability which has been empirically observed in a lot of recent research works. Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters. To better choose the tunable parameters, we propose a novel Second-order Approximation Method (SAM) which approximates the original problem with an analytically solvable optimization function. The tunable parameters are determined by directly optimizing the approximation function. The experimental results show that our proposed SAM model outperforms many strong baseline models and it also verifies our theoretical analysis.

Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models

Exploring Low-dimensional Intrinsic Task Subspace Via Prompt Tuning.

Exploring Universal Intrinsic Task Subspace Via Prompt Tuning

Exploring Universal Intrinsic Task Subspace for Few-shot Learning Via Prompt Tuning

Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation

Parameter-efficient fine-tuning of large-scale pre-trained language models

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

Small Pre-trained Language Models Can Be Fine-tuned As Large Models Via Over-Parameterization.

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

Revisiting K-Nn for Fine-Tuning Pre-trained Language Models

NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better

Learning Global Controller in Latent Space for Parameter-Efficient Fine-Tuning

An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

A Closer Look at How Fine-tuning Changes BERT

Output Layer Go First: Better Fine-tuning by Bridging the Gap with Pre-training

On the Effectiveness of Parameter-Efficient Fine-Tuning

ADT: an Additive Delta-Tuning Approach for Parameter-Efficient Tuning in Pre-Trained Language Models

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization