AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

Jiang Hu,Quanzheng Li
2024-03-20
Abstract:Recent advancements in large-scale pretrained models have significantly improved performance across a variety of tasks in natural language processing and computer vision. However, the extensive number of parameters in these models necessitates substantial memory and computational resources for full training. To adapt these models for downstream tasks or specific application-oriented datasets, parameter-efficient fine-tuning methods leveraging pretrained parameters have gained considerable attention. However, it can still be time-consuming due to lots of parameters and epochs. In this work, we introduce AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks. Our key observation is that the associated generalized Fisher information matrix is either low-rank or extremely small-scaled. Such a generalized Fisher information matrix is shown to be equivalent to the Hessian matrix. Moreover, we prove the global convergence of AdaFish, along with its iteration/oracle complexity. Numerical experiments show that our algorithm is quite competitive with the state-of-the-art AdamW method.
Machine Learning
What problem does this paper attempt to address?
This paper aims to address the challenges faced by large - scale pre - trained models when performing parameter - efficient fine - tuning on task - specific or application - oriented datasets. Specifically, although these pre - trained models have achieved significant performance improvements in various tasks such as natural language processing and computer vision, their large number of parameters requires a large amount of memory and computing resources for full - scale training. To adapt to datasets for downstream tasks or specific application scenarios, researchers have begun to focus on parameter - efficient fine - tuning methods that utilize pre - trained parameters. However, even though these methods can reduce the number of adjusted parameters, they still take a long time due to the large number of parameters and training epochs. To this end, the paper proposes the AdaFish algorithm, which is an efficient second - order optimization algorithm designed based on the low - rank factorization framework. Its core observation is that the generalized Fisher information matrix related to the model is either low - rank or very small - scale. The paper proves that this generalized Fisher information matrix is equivalent to the Hessian matrix and further proves the global convergence of the AdaFish algorithm and its iteration/query complexity. Through numerical experiments, the paper shows that the AdaFish algorithm is superior to the current state - of - the - art AdamW method in both training speed and final performance. ### Main contributions of the paper: 1. **Utilizing low - rank characteristics**: The paper utilizes the inherent low - rank characteristics of the weight matrix during the fine - tuning process and introduces a new method to approximate Hessian information using the portable Fisher information matrix. Under certain conditions, the equivalence relationship between this Fisher information matrix and the Hessian matrix is established, emphasizing its utility in efficiently capturing second - order information. 2. **Proposing the AdaFish algorithm**: Combining the exponential moving average and the characteristics of being easy to store and calculate, the paper constructs an adaptive Fisher information matrix as an effective substitute for the second - order momentum in AdamW. Unlike traditional methods, this Fisher information matrix is neither purely diagonal nor element - wise. 3. **Theoretical analysis and empirical evaluation**: The paper establishes the convergence and iteration/query complexity of AdaFish. Through empirical evaluation on image classification and language processing tasks, it is shown that AdaFish not only outperforms AdamW in performance but also can reduce the required number of training epochs by half, highlighting its efficiency and effectiveness in model fine - tuning. ### Conclusion: The AdaFish algorithm proposed in the paper effectively accelerates the parameter - efficient fine - tuning process by utilizing the low - rank structure and the adaptive Fisher information matrix. The experimental results show that AdaFish performs well on multiple tasks and has broad application potential. Future work can further explore expanding AdaFish to more low - rank - based fine - tuning frameworks.