BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Taolin Zhang,Jinpeng Wang,Hang Guo,Tao Dai,Bin Chen,Shu-Tao Xia
2024-10-24
Abstract:Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves. In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework. Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself. We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to improve the generalization ability of pre - trained vision - language models (such as CLIP) in various downstream tasks without the need for target - domain data. Specifically, the paper aims to alleviate the domain - shift problem and improve the performance of the model on unknown - distribution or cross - domain datasets by introducing a new test - time adaptation (TTA) method - BoostAdapter. ### Problem Background Existing TTA methods are mainly divided into two categories: 1. **TTA methods that require training**: Such as TPT, these methods adjust model parameters or learn prompt words through self - supervised objectives (such as entropy minimization) to increase the confidence of model predictions. However, these methods have high computational overhead and are not suitable for situations with limited computational resources. 2. **TTA methods that do not require training**: Such as TDA, these methods use memory networks, caches, or prototype storage to store information about target samples and distributions, thereby adaptively modifying model predictions. However, these methods only consider the interaction with other historical samples and fail to fully utilize the information of the test sample itself, resulting in poor performance on tasks that require fine - grained information. ### Research Motivation The author raises three key questions: 1. What is the connection between methods that require training (such as TPT) and methods that do not require training (such as TDA)? 2. How to combine the advantages of these two methods? 3. Can vision - language models benefit from this combination? ### Solution To solve the above problems, the author proposes BoostAdapter, a new test - time adaptation strategy. The main innovations of BoostAdapter include: 1. **Regional Bootstrapping**: By performing augmentation operations such as random cropping and horizontal flipping on test samples, high - quality augmented samples (boosting samples) are generated, which are closer to the target cluster. 2. **Combining historical samples and augmented samples**: BoostAdapter maintains a lightweight memory bank, which contains historical samples filtered from the test data stream and augmented samples generated through regional bootstrapping. Historical samples are used to extract useful information about the target distribution, while augmented samples capture the characteristics of the test sample itself. 3. **Theoretical analysis and experimental verification**: The author proves the effectiveness of BoostAdapter through theoretical derivation and conducts experimental verification on multiple benchmark datasets, demonstrating its applicability and superior performance in real - world scenarios. ### Main Contributions 1. **Establishing connections**: For the first time, the relationship between TTA methods that require training and those that do not require training is discussed, and the connection between them is established. 2. **Proposing a new method**: BoostAdapter is proposed, which improves the adapter that does not require training by introducing augmented samples. 3. **Theoretical derivation**: Theoretically, the target - domain error bound of BoostAdapter is derived, proving the role of self - bootstrapped data in improving its performance. 4. **Experimental verification**: Extensive experiments show that BoostAdapter has superior performance in the test - time adaptation setting. Through these contributions, BoostAdapter not only improves the generalization ability of vision - language models on unknown distributions but also provides a new idea for effectively combining methods that require training and those that do not require training.