AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Yuhan Zhu,Yuyang Ji,Zhiyu Zhao,Gangshan Wu,Limin Wang
2024-10-06
Abstract:Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem that visual - language models (VLMs) cannot fully realize their potential when adapting to the understanding of new concepts due to limited new - category information. Specifically, although existing pre - trained VLMs perform well in various visual classification tasks, when directly applied to zero - shot and few - shot learning tasks for new categories, they often have poor performance because of the lack of sufficient information about new categories. ### Main problems 1. **Limitations in understanding new concepts**: When adapting to new concepts, VLMs' performance will be limited due to the lack of detailed information about new categories. 2. **Insufficient effectiveness of input information**: When tested with original images and category names, VLMs cannot focus on specific important regions or features, resulting in unsatisfactory recognition results. 3. **Complexity of multi - modal interaction**: It is difficult to effectively explore the semantic correlation between visual and text modalities, especially in the enhanced multi - perspective situation. ### Solutions To solve the above problems, the paper proposes a novel adaptation framework - AWT (Augment, Weight, then Transport), aiming to improve the adaptation ability of VLMs through the following three steps: 1. **Augment**: - **Visual augmentation**: Generate diverse image views through data augmentation techniques such as random scaling, cropping, and flipping. - **Text augmentation**: Use large - language models (LLMs) to generate detailed category descriptions, ensuring the diversity of descriptions and their relevance to visual content. 2. **Weight**: - **Dynamic weighting mechanism**: Dynamically evaluate the importance of each view based on prediction entropy. More confident predictions usually indicate higher accuracy. This step can identify and prioritize important views and adjust the importance distribution to adapt to specific tasks. 3. **Transport**: - **Optimal Transport**: Formulate the image - text distance calculation as an optimal transport problem. Consider the importance of each augmented view and discover cross - modal correlations by minimizing the transport cost. The specific formula is as follows: \[ L_c(\alpha, \beta) = \min_{P \in U(a,b)} \langle C, P \rangle = \sum_{i,j} C_{i,j} P_{i,j} \] where \(\alpha\) and \(\beta\) respectively represent the discrete distributions of image and text views, and \(C_{i,j}\) is the transport cost matrix from the source location \(x_i\) to the target location \(y_j\). Through these steps, the AWT framework can significantly improve the performance of VLMs in zero - shot and few - shot image classification, zero - shot video action recognition, and out - of - distribution generalization tasks without additional training. Experimental results show that AWT outperforms existing methods in multiple challenging scenarios.