Bridge the Modality and Capacity Gaps in Vision-Language Model Selection

Chao Yi,De-Chuan Zhan,Han-Jia Ye

2024-03-21

Abstract:Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. Thus, a promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" -- the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" -- the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps. SWAB first adopts optimal transport to capture the relevance between open-source datasets and target dataset with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging those two gaps and enhancing the VLM's capacity estimation for VLM selection. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.

Machine Learning,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper focuses on the challenges encountered when selecting the best Visual Language Model (VLM) for zero-shot image classification. VLM is able to classify images with zero training data by leveraging textual class names, but two challenges exist: Modality Gap and Capability Gap. Modality Gap refers to the embedding differences between different modalities (such as images and text), which makes it inaccurate to replace images with text. Capability Gap refers to the differences between the overall ranking of VLM and its performance ranking on specific tasks. The paper proposes a method called VLM Selection With gAp Bridging (SWAB) to mitigate these two gaps. SWAB first captures the correlation between the open-source dataset and the target dataset using the optimal transport algorithm. Then, it utilizes this correlation to transfer the statistical information of VLM on the open-source dataset to the target dataset, bridging the modality gap and enhancing the estimation of VLM's capability. Experimental results demonstrate the effectiveness of SWAB on various VLMs and image classification datasets.

Bridge the Modality and Capacity Gaps in Vision-Language Model Selection

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Exploring Vision-Language Models for Imbalanced Learning

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Bridging Vision and Language Spaces with Assignment Prediction

LOVM: Language-Only Vision Model Selection

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Vision-Language Models for Vision Tasks: A Survey

VLIS: Unimodal Language Models Guide Multimodal Language Generation

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

The Neglected Tails in Vision-Language Models

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

Visually-Augmented Language Modeling

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Rethinking VLMs and LLMs for Image Classification