A survey of efficient fine-tuning methods for Vision-Language Models — Prompt and Adapter

JiaLu Xing,JianPing Liu,Jian Wang,LuLu Sun,Xi Chen,XunXun Gu,YingFei Wang
DOI: https://doi.org/10.1016/j.cag.2024.01.012
IF: 1.821
2024-01-31
Computers & Graphics
Abstract:Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision-Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analyzing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthmore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.
computer science, software engineering
What problem does this paper attempt to address?