Abstract:The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities. We enforce consistency between the respective encoder branches (receiving augmented inputs) to prevent overfitting in downstream tasks. Our method is evaluated on three representative tasks: generalization to novel classes, cross-dataset evaluation, and unseen domain shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **Improve the generalization ability of vision - language pre - training models (VLP) in few - shot fine - tuning settings**. Specifically, the author aims to improve the performance of existing VLP models (such as CLIP) in handling novel categories, cross - dataset evaluation, and unseen domain transfer tasks by combining adapter learning and prompt learning methods. ### Analysis of the Main Problem 1. **Generalization Problem in Few - Shot Fine - Tuning**: - Although existing VLP models (such as CLIP) have strong zero - shot generalization ability after large - scale pre - training, their performance is often not satisfactory when it comes to few - shot fine - tuning. This is mainly due to the large scale of these models and the lack of sufficient training data. 2. **Multi - Modal Alignment and Consistency**: - In order to improve the generalization ability of the model in different tasks, it is necessary to enhance the alignment and consistency between the image and text encoders. Existing methods usually only focus on unimodal prompt learning or adapter adjustment, ignoring the synergy between the two. 3. **Preventing Overfitting**: - In downstream tasks, the model is prone to overfitting to a specific data distribution, resulting in a decline in generalization ability. Therefore, effective regularization strategies need to be introduced to prevent overfitting. ### Solutions To solve the above problems, the author proposes **APoLLo** (Unified Adapter and Prompt Learning for Vision - Language Models), a unified multi - modal adapter and prompt learning method. Specific measures include: 1. **Introducing Trainable Cross - Attention Adapter Layers**: - Add trainable cross - attention adapter layers in the visual and language encoders to strengthen the alignment between the two modalities. 2. **Multi - Modal Input Enhancement**: - Use a pre - trained language model to generate descriptive text as an enhanced sample for the text branch, and use a text - conditional diffusion model to generate image - enhanced samples, thereby further regularizing the model. 3. **Contrastive Consistency Loss**: - Introduce contrastive consistency loss to ensure the consistency between different encoder branches receiving enhanced inputs and prevent overfitting. 4. **Cross - Modal Similarity Maximization**: - Further enhance the learning of multi - modal features by maximizing the similarity between image - text pairs. ### Experimental Results The experimental results show that APoLLo significantly outperforms existing state - of - the - art methods (such as MaPLe) on multiple benchmark datasets, especially when dealing with novel categories, cross - dataset evaluation, and unseen domain transfer tasks. Specifically, APoLLo achieves a relatively high performance improvement of up to 6.03% on 10 different image recognition datasets. In conclusion, this paper successfully solves the generalization problem of VLP models in few - shot fine - tuning settings by proposing the APoLLo framework, significantly improving the performance of the model in various downstream tasks.

APoLLo: Unified Adapter and Prompt Learning for Vision Language Models

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

MaPLe: Multi-modal Prompt Learning

Multi-modal Attribute Prompting for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Adversarial Prompt Distillation for Vision-Language Models

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

DPL: Decoupled Prompt Learning for Vision-Language Models

Unsupervised Prompt Learning for Vision-Language Models

COMMA: Co-Articulated Multi-Modal Learning

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Multi-Modal Adapter for Vision-Language Models

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

PILL: Plug Into LLM with Adapter Expert and Attention Gate

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

LaViP:Language-Grounded Visual Prompts