PROFIT: A Specialized Optimizer for Deep Fine Tuning

Anirudh S Chakravarthy,Shuai Kyle Zheng,Xin Huang,Sachithra Hemachandra,Xiao Zhang,Yuning Chai,Zhao Chen
2024-12-09
Abstract:Fine-tuning pre-trained models has become invaluable in computer vision and robotics. Recent fine-tuning approaches focus on improving efficiency rather than accuracy by using a mixture of smaller learning rates or frozen backbones. To return the spotlight to model accuracy, we present PROFIT (Proximally Restricted Optimizer For Iterative Training), one of the first optimizers specifically designed for incrementally fine-tuning converged models on new tasks or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initialization, PROFIT leverages the structure of a converged model to regularize the optimization process, leading to improved results. By employing a simple temporal gradient orthogonalization process, PROFIT outperforms traditional fine-tuning methods across various tasks: image classification, representation learning, and large-scale motion prediction. Moreover, PROFIT is encapsulated within the optimizer logic, making it easily integrated into any training pipeline with minimal engineering effort. A new class of fine-tuning optimizers like PROFIT can drive advancements as fine-tuning and incremental training become increasingly prevalent, reducing reliance on costly model training from scratch.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in deep learning, how to improve the accuracy of the model when fine - tuning pre - trained models, rather than just focusing on efficiency. Specifically, the author proposes a new optimizer PROFIT (Proximally Restricted Optimizer For Iterative Training), aiming to improve the effect of fine - tuning by using the structure of the converged model to regularize the optimization process. ### Background of the paper and problem description 1. **Importance of fine - tuning**: - As the size of datasets and models increases, it becomes impractical to train new models from scratch for each new application or environment. - For example, when self - driving cars enter a new city each time or camera applications need to recognize new types of objects, the cost of retraining the model is very high. 2. **Limitations of existing fine - tuning methods**: - Existing fine - tuning methods mainly focus on efficiency rather than accuracy, such as using a smaller learning rate or freezing the backbone network. - Although these methods improve efficiency, they may lead to a decline in model performance, especially when dealing with new tasks or datasets. 3. **Catastrophic forgetting problem**: - The model is prone to forget old information during the fine - tuning process, which is so - called "Catastrophic Forgetting". - Methods to solve this problem usually require additional data engineering and model architecture modification. 4. **Status of optimizers**: - Current optimizers (such as SGD, Adam, etc.) are originally designed to train models from scratch with fewer assumptions about the problem setting. - This makes them perform poorly in fine - tuning scenarios because fine - tuning usually starts from an already well - trained model. ### Proposal of the PROFIT optimizer To overcome the above problems, the author proposes the PROFIT optimizer, which has the following main features: - **Utilizing the structure of the converged model**: PROFIT regularizes the optimization process by keeping the model state close to the initial good state. - **Multi - task learning in the time dimension**: Regarding fine - tuning as a multi - task learning problem in the time dimension, and coordinating the conflicts between different tasks through gradient orthogonalization. - **Simple and easy to integrate**: PROFIT is encapsulated in the optimizer logic and can be easily integrated into any training pipeline without complex engineering efforts. ### Experimental verification The author verifies the effectiveness of PROFIT through multiple experiments, including low - dimensional toy examples, image classification, and large - scale motion prediction tasks. The experimental results show that PROFIT performs excellently in these tasks and significantly outperforms traditional fine - tuning methods. ### Summary The core problem of this paper is: how to improve the accuracy of the model when fine - tuning pre - trained models, rather than just improving efficiency. PROFIT successfully solves this problem by using the structure of the converged model and introducing a multi - task learning mechanism in the time dimension, and shows superior performance in multiple tasks.