Abstract:Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we investigate two remarkable phenomena observed during the fine-tuning of LLMs, particularly focusing on the attention mechanism: (1) Different Impact, optimizing the $\mathbf{W}_v$ matrix significantly improves performance over optimizing the $\mathbf{W}_k$ matrix. Fine-tuning only the $\mathbf{W}_q$ and $\mathbf{W}_v$ matrices is computationally efficient, delivering results that are comparable to, or even better than, fine-tuning all three matrices $\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$. (2) Efficient Convergence, employing distinct learning rates for these matrices is crucial for optimal performance, with a higher learning rate for the $\mathbf{W}_v$ matrix expediting convergence. However, theoretical analyses of these phenomena are still relatively limited. We present a theoretical analysis of these phenomena from two perspectives: (i) Generalization, where we demonstrate that fine-tuning only $\mathbf{W}_q$ and $\mathbf{W}_v$ improves generalization bounds, enhances memory efficiency, and (ii) Optimization, where we emphasize that the feature learning of the attention mechanism is efficient, particularly when using distinct learning rates for the matrices, which leads to more effective fine-tuning. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving lightweight algorithms in LLMs fine-tuning.

Learning Global Controller in Latent Space for Parameter-Efficient Fine-Tuning

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Parameter-efficient Tuning for Large Language Model Without Calculating Its Gradients

Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs

Parameter-efficient fine-tuning of large-scale pre-trained language models

Towards a Unified View of Parameter-Efficient Transfer Learning

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Position-Aware Parameter Efficient Fine-Tuning Approach for Reducing Positional Bias in LLMs

Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning

Aligner: One Global Token is Worth Millions of Parameters when Aligning Large Language Models

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model

Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models

HFT: Half Fine-Tuning for Large Language Models

Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning