Vision-Language Models for Vision Tasks: A Survey

Jingyi Zhang,Jiaxing Huang,Sheng Jin,Shijian Lu

2024-02-16

Abstract:Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at <a class="link-external link-https" href="https://github.com/jingyi0000/VLM_survey" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily aims to address two major challenges in visual recognition research: 1. **The slow convergence problem in training Deep Neural Networks (DNNs)**: Under the traditional deep learning paradigm from scratch, training DNNs usually requires a significant amount of time to converge. 2. **The issue of collecting large-scale, task-specific, manually labeled datasets**: To train DNNs, it is often necessary to collect large-scale, manually labeled datasets specific to the task, which is a time-consuming and labor-intensive process. To address the above two challenges, the paper proposes a new approach utilizing Vision-Language Models (VLMs). VLMs learn rich visual-language associations from the almost infinitely available image-text pairs on the internet and can perform zero-shot predictions without additional fine-tuning for each visual recognition task. This approach greatly simplifies the visual recognition process and improves efficiency. Specifically, the contributions of the paper include: - Systematically reviewing Vision-Language Models used for various visual recognition tasks (such as image classification, object detection, semantic segmentation, etc.). - Conducting comprehensive benchmarking and discussion of existing works. - Proposing future research directions and challenges. Through these efforts, the paper aims to provide researchers in the field of visual recognition with a clear overall picture, showcasing the current achievements and potential future developments.

Vision-Language Models for Vision Tasks: A Survey

Vision-Language Models for Vision Tasks: A Survey

An Introduction to Vision-Language Modeling

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

VLP: A Survey on Vision-language Pre-training

A Survey of Vision-Language Pre-Trained Models

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Vision Language Models in Autonomous Driving: A Survey and Outlook

Vision-Language Models in Remote Sensing: Current progress and future trends

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

Towards Vision-Language Geo-Foundation Model: A Survey

Vision-Language Models under Cultural and Inclusive Considerations

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

A Survey on Vision-Language-Action Models for Embodied AI

Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

A Survey of Medical Vision-and-Language Applications and Their Techniques

HumanVLM: Foundation for Human-Scene Vision-Language Model

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

Visually-Augmented Language Modeling