Abstract:Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes--textual and visual--to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency. Code is available at <a class="link-external link-https" href="https://github.com/zhangce01/DPE-CLIP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the generalization ability of vision - language models (VLMs) at test time, especially how to effectively adapt to new, unseen distributions when labels are scarce. Specifically, the paper proposes improvement schemes for the following two key issues: 1. **Accumulation problem**: Existing test - time adjustment methods (such as test - time prompt tuning TPT) usually handle each test sample independently and cannot accumulate task - specific knowledge from previous test samples. This results in the model performance not improving correspondingly as more examples are seen. 2. **Multimodal problem**: Most existing methods only adjust from a single modality (text or vision) and fail to fully utilize text and visual information to enhance the generalization ability of the model. To solve these problems, the paper proposes a new method named **Dual Prototype Evolving (DPE)**. DPE gradually captures more accurate multimodal representations by evolving text and visual prototypes simultaneously, and improves zero - shot generalization ability by introducing learnable residual parameters to align prototypes of different modalities. ### Specific contributions - **Dual Prototype Evolving**: DPE designs two sets of prototypes - text prototypes and visual prototypes - to gradually capture multimodal representations of target categories. - **Aligning Multimodal Prototypes**: To promote consistent multimodal representations, DPE introduces and optimizes learnable residual parameters for each test sample to ensure the consistency of text and visual prototypes. - **Experimental Verification**: Extensive experimental results show that DPE significantly outperforms existing state - of - the - art methods on 15 benchmark datasets while maintaining high computational efficiency. ### Method Overview The core idea of DPE is to gradually accumulate task - specific knowledge through online updating of text and visual prototypes. The specific steps are as follows: 1. **Text Prototype Evolution**: Update text prototypes online by cumulative averaging, filter out samples with low confidence, and ensure stable online updates. 2. **Visual Prototype Evolution**: Use a priority queue strategy to store high - confidence image features and dynamically update visual prototypes. 3. **Prototype Residual Learning**: Introduce learnable residual parameters, optimize multimodal prototypes, minimize the entropy loss of prediction and enforce multimodal alignment. Through these innovations, DPE can effectively adapt to new distributions at test time, improve the generalization ability of vision - language models, and perform particularly well when labels are scarce.

Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-time Adaptation for Vision-Language Models

DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

Decoupled Prototype Learning for Reliable Test-Time Adaptation

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models

Efficient Test-Time Adaptation of Vision-Language Models

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

Multi-Modal Adapter for Vision-Language Models

Unsupervised Prototype Adapter for Vision-Language Models

NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Test-time Alignment-Enhanced Adapter for Vision-Language Models

Progressive Prototype Evolving for Dual-Forgetting Mitigation in Non-Exemplar Online Continual Learning

DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime

Integrating Dual Prototypes for Task-Wise Adaption in Pre-Trained Model-Based Class-Incremental Learning