Abstract:Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.

What problem does this paper attempt to address?

This paper attempts to solve the problem that visual - language models (VLMs) cannot fully realize their potential when adapting to the understanding of new concepts due to limited new - category information. Specifically, although existing pre - trained VLMs perform well in various visual classification tasks, when directly applied to zero - shot and few - shot learning tasks for new categories, they often have poor performance because of the lack of sufficient information about new categories. ### Main problems 1. **Limitations in understanding new concepts**: When adapting to new concepts, VLMs' performance will be limited due to the lack of detailed information about new categories. 2. **Insufficient effectiveness of input information**: When tested with original images and category names, VLMs cannot focus on specific important regions or features, resulting in unsatisfactory recognition results. 3. **Complexity of multi - modal interaction**: It is difficult to effectively explore the semantic correlation between visual and text modalities, especially in the enhanced multi - perspective situation. ### Solutions To solve the above problems, the paper proposes a novel adaptation framework - AWT (Augment, Weight, then Transport), aiming to improve the adaptation ability of VLMs through the following three steps: 1. **Augment**: - **Visual augmentation**: Generate diverse image views through data augmentation techniques such as random scaling, cropping, and flipping. - **Text augmentation**: Use large - language models (LLMs) to generate detailed category descriptions, ensuring the diversity of descriptions and their relevance to visual content. 2. **Weight**: - **Dynamic weighting mechanism**: Dynamically evaluate the importance of each view based on prediction entropy. More confident predictions usually indicate higher accuracy. This step can identify and prioritize important views and adjust the importance distribution to adapt to specific tasks. 3. **Transport**: - **Optimal Transport**: Formulate the image - text distance calculation as an optimal transport problem. Consider the importance of each augmented view and discover cross - modal correlations by minimizing the transport cost. The specific formula is as follows: \[ L_c(\alpha, \beta) = \min_{P \in U(a,b)} \langle C, P \rangle = \sum_{i,j} C_{i,j} P_{i,j} \] where \(\alpha\) and \(\beta\) respectively represent the discrete distributions of image and text views, and \(C_{i,j}\) is the transport cost matrix from the source location \(x_i\) to the target location \(y_j\). Through these steps, the AWT framework can significantly improve the performance of VLMs in zero - shot and few - shot image classification, zero - shot video action recognition, and out - of - distribution generalization tasks without additional training. Experimental results show that AWT outperforms existing methods in multiple challenging scenarios.

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

A-VL: Adaptive Attention for Large Vision-Language Models

Bridge the Modality and Capacity Gaps in Vision-Language Model Selection

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Bridging Vision and Language Spaces with Assignment Prediction

Exploring Vision-Language Models for Imbalanced Learning

Visually-Augmented Language Modeling

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

Efficient Transfer Learning for Video-language Foundation Models

VILA$^2$: VILA Augmented VILA

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Test-time Alignment-Enhanced Adapter for Vision-Language Models

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models