Vision-Language Models for Vision Tasks: A Survey

Jingyi Zhang,Jiaxing Huang,Sheng Jin,Shijian Lu
2024-02-16
Abstract:Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at <a class="link-external link-https" href="https://github.com/jingyi0000/VLM_survey" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily aims to address two major challenges in visual recognition research: 1. **The slow convergence problem in training Deep Neural Networks (DNNs)**: Under the traditional deep learning paradigm from scratch, training DNNs usually requires a significant amount of time to converge. 2. **The issue of collecting large-scale, task-specific, manually labeled datasets**: To train DNNs, it is often necessary to collect large-scale, manually labeled datasets specific to the task, which is a time-consuming and labor-intensive process. To address the above two challenges, the paper proposes a new approach utilizing Vision-Language Models (VLMs). VLMs learn rich visual-language associations from the almost infinitely available image-text pairs on the internet and can perform zero-shot predictions without additional fine-tuning for each visual recognition task. This approach greatly simplifies the visual recognition process and improves efficiency. Specifically, the contributions of the paper include: - Systematically reviewing Vision-Language Models used for various visual recognition tasks (such as image classification, object detection, semantic segmentation, etc.). - Conducting comprehensive benchmarking and discussion of existing works. - Proposing future research directions and challenges. Through these efforts, the paper aims to provide researchers in the field of visual recognition with a clear overall picture, showcasing the current achievements and potential future developments.