Abstract:Recent advancements in large-scale pretrained models have significantly improved performance across a variety of tasks in natural language processing and computer vision. However, the extensive number of parameters in these models necessitates substantial memory and computational resources for full training. To adapt these models for downstream tasks or specific application-oriented datasets, parameter-efficient fine-tuning methods leveraging pretrained parameters have gained considerable attention. However, it can still be time-consuming due to lots of parameters and epochs. In this work, we introduce AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks. Our key observation is that the associated generalized Fisher information matrix is either low-rank or extremely small-scaled. Such a generalized Fisher information matrix is shown to be equivalent to the Hessian matrix. Moreover, we prove the global convergence of AdaFish, along with its iteration/oracle complexity. Numerical experiments show that our algorithm is quite competitive with the state-of-the-art AdamW method.

What problem does this paper attempt to address?

This paper aims to address the challenges faced by large - scale pre - trained models when performing parameter - efficient fine - tuning on task - specific or application - oriented datasets. Specifically, although these pre - trained models have achieved significant performance improvements in various tasks such as natural language processing and computer vision, their large number of parameters requires a large amount of memory and computing resources for full - scale training. To adapt to datasets for downstream tasks or specific application scenarios, researchers have begun to focus on parameter - efficient fine - tuning methods that utilize pre - trained parameters. However, even though these methods can reduce the number of adjusted parameters, they still take a long time due to the large number of parameters and training epochs. To this end, the paper proposes the AdaFish algorithm, which is an efficient second - order optimization algorithm designed based on the low - rank factorization framework. Its core observation is that the generalized Fisher information matrix related to the model is either low - rank or very small - scale. The paper proves that this generalized Fisher information matrix is equivalent to the Hessian matrix and further proves the global convergence of the AdaFish algorithm and its iteration/query complexity. Through numerical experiments, the paper shows that the AdaFish algorithm is superior to the current state - of - the - art AdamW method in both training speed and final performance. ### Main contributions of the paper: 1. **Utilizing low - rank characteristics**: The paper utilizes the inherent low - rank characteristics of the weight matrix during the fine - tuning process and introduces a new method to approximate Hessian information using the portable Fisher information matrix. Under certain conditions, the equivalence relationship between this Fisher information matrix and the Hessian matrix is established, emphasizing its utility in efficiently capturing second - order information. 2. **Proposing the AdaFish algorithm**: Combining the exponential moving average and the characteristics of being easy to store and calculate, the paper constructs an adaptive Fisher information matrix as an effective substitute for the second - order momentum in AdamW. Unlike traditional methods, this Fisher information matrix is neither purely diagonal nor element - wise. 3. **Theoretical analysis and empirical evaluation**: The paper establishes the convergence and iteration/query complexity of AdaFish. Through empirical evaluation on image classification and language processing tasks, it is shown that AdaFish not only outperforms AdamW in performance but also can reduce the required number of training epochs by half, highlighting its efficiency and effectiveness in model fine - tuning. ### Conclusion: The AdaFish algorithm proposed in the paper effectively accelerates the parameter - efficient fine - tuning process by utilizing the low - rank structure and the adaptive Fisher information matrix. The experimental results show that AdaFish performs well on multiple tasks and has broad application potential. Future work can further explore expanding AdaFish to more low - rank - based fine - tuning frameworks.

AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

AdaFisher: Adaptive Second Order Optimization via Fisher Information

An Efficient Fisher Matrix Approximation Method for Large-Scale Neural Network Optimization

Cuttlefish: Low-Rank Model Training without All the Tuning

Efficient Model Compression Techniques with FishLeg

HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

PELA: Learning Parameter-Efficient Models with Low-Rank Approximation

Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning

Data-oriented Dynamic Fine-tuning Parameter Selection Strategy for FISH Mask based Efficient Fine-tuning

LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

Second-Order Fine-Tuning Without Pain for LLMs: A Hessian Informed Zeroth-Order Optimizer.

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications

AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

Applying Second Order Optimization to Deep Transformers with Parameter-Efficient Tuning

Parameter-efficient Tuning for Large Language Model Without Calculating Its Gradients

HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation

LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models