Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

Ce Zhang,Simon Stepputtis,Katia Sycara,Yaqi Xie
2024-10-17
Abstract:Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes--textual and visual--to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency. Code is available at <a class="link-external link-https" href="https://github.com/zhangce01/DPE-CLIP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the generalization ability of vision - language models (VLMs) at test time, especially how to effectively adapt to new, unseen distributions when labels are scarce. Specifically, the paper proposes improvement schemes for the following two key issues: 1. **Accumulation problem**: Existing test - time adjustment methods (such as test - time prompt tuning TPT) usually handle each test sample independently and cannot accumulate task - specific knowledge from previous test samples. This results in the model performance not improving correspondingly as more examples are seen. 2. **Multimodal problem**: Most existing methods only adjust from a single modality (text or vision) and fail to fully utilize text and visual information to enhance the generalization ability of the model. To solve these problems, the paper proposes a new method named **Dual Prototype Evolving (DPE)**. DPE gradually captures more accurate multimodal representations by evolving text and visual prototypes simultaneously, and improves zero - shot generalization ability by introducing learnable residual parameters to align prototypes of different modalities. ### Specific contributions - **Dual Prototype Evolving**: DPE designs two sets of prototypes - text prototypes and visual prototypes - to gradually capture multimodal representations of target categories. - **Aligning Multimodal Prototypes**: To promote consistent multimodal representations, DPE introduces and optimizes learnable residual parameters for each test sample to ensure the consistency of text and visual prototypes. - **Experimental Verification**: Extensive experimental results show that DPE significantly outperforms existing state - of - the - art methods on 15 benchmark datasets while maintaining high computational efficiency. ### Method Overview The core idea of DPE is to gradually accumulate task - specific knowledge through online updating of text and visual prototypes. The specific steps are as follows: 1. **Text Prototype Evolution**: Update text prototypes online by cumulative averaging, filter out samples with low confidence, and ensure stable online updates. 2. **Visual Prototype Evolution**: Use a priority queue strategy to store high - confidence image features and dynamically update visual prototypes. 3. **Prototype Residual Learning**: Introduce learnable residual parameters, optimize multimodal prototypes, minimize the entropy loss of prediction and enforce multimodal alignment. Through these innovations, DPE can effectively adapt to new distributions at test time, improve the generalization ability of vision - language models, and perform particularly well when labels are scarce.