Generalization Boosted Adapter for Open-Vocabulary Segmentation

Wenhao Xu,Changwei Wang,Xuxiang Feng,Rongtao Xu,Longzhao Huang,Zherui Zhang,Li Guo,Shibiao Xu
2024-09-13
Abstract:Vision-language models (VLMs) have demonstrated remarkable open-vocabulary object recognition capabilities, motivating their adaptation for dense prediction tasks like segmentation. However, directly applying VLMs to such tasks remains challenging due to their lack of pixel-level granularity and the limited data available for fine-tuning, leading to overfitting and poor generalization. To address these limitations, we propose Generalization Boosted Adapter (GBA), a novel adapter strategy that enhances the generalization and robustness of VLMs for open-vocabulary segmentation. GBA comprises two core components: (1) a Style Diversification Adapter (SDA) that decouples features into amplitude and phase components, operating solely on the amplitude to enrich the feature space representation while preserving semantic consistency; and (2) a Correlation Constraint Adapter (CCA) that employs cross-attention to establish tighter semantic associations between text categories and target regions, suppressing irrelevant low-frequency ``noise'' information and avoiding erroneous associations. Through the synergistic effect of the shallow SDA and the deep CCA, GBA effectively alleviates overfitting issues and enhances the semantic relevance of feature representations. As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods, demonstrating broad applicability and achieving state-of-the-art performance on multiple open-vocabulary segmentation benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively apply vision - language models (VLMs) to the open - vocabulary image segmentation task, while overcoming the problems of over - fitting easily and insufficient generalization ability of existing methods in the case of limited data. ### Problem Background Vision - language models (such as CLIP) have demonstrated strong open - vocabulary object recognition capabilities through cross - modal contrastive learning, but there are challenges in directly applying them to dense prediction tasks (such as segmentation): 1. **Lack of Pixel - level Granularity**: These models use image - level supervision signals during pre - training and cannot capture pixel - level detailed features. 2. **Limited Fine - tuning Data**: Since the scale of the data set used for fine - tuning is small, it is easy to cause over - fitting, thus affecting the generalization ability of the model. 3. **Interference of Irrelevant Information**: The training data contains a large amount of "noise" information (such as background textures and object styles) that is irrelevant to semantic categories. Direct fine - tuning may cause the model to pay too much attention to this irrelevant information and establish wrong associations. ### Solution To solve the above problems, the author proposes a new adapter strategy named Generalization Boosted Adapter (GBA). GBA consists of two core components: 1. **Style Diversification Adapter (SDA)**: - **Function**: Decompose features into amplitude and phase components through Fourier transform, and only operate on the amplitude to enrich the feature space representation while maintaining semantic consistency. - **Mechanism**: Specifically, SDA calculates the frequency - domain features of the input samples, normalizes and fuses their styles to generate samples with new styles, thereby enhancing feature diversity and maintaining content consistency. - **Formulas**: \[ a = \sqrt{F(x)_{\text{real}}^2 + F(x)_{\text{img}}^2} \] \[ p = \arctan\left(\frac{F(x)_{\text{img}}}{F(x)_{\text{real}}}\right) \] \[ \mu = W \cdot \mu_{\text{base}}, \quad \sigma = W \cdot \sigma_{\text{base}} \] \[ a_{\text{new}} = \sigma \cdot a + \mu \] \[ \tilde{x} = \text{IFFT}(\text{Compose}(a_{\text{new}}, p)) \] 2. **Correlation Constraint Adapter (CCA)**: - **Function**: Establish a closer semantic association between text categories and target regions through the cross - attention mechanism, suppress irrelevant low - frequency "noise" information, and avoid wrong associations. - **Mechanism**: CCA utilizes the strong correlation between high - frequency components (such as object edges and contours) and semantic features to guide the model to learn semantically relevant high - frequency information, thereby improving the accuracy of category matching. - **Formulas**: \[ \text{Attn}(Q_z, K, V) = \text{softmax}\left(\frac{Q_z K^T}{\sqrt{d_k}}\right)V \] \[ Q_z = \phi_q(X_z^v), \quad K = \phi_k(X_t), \quad V = \phi_v(X_t) \] ### Summary By combining the shallow SDA and the deep CCA, GBA effectively alleviates the over - fitting problem, enhances the semantic relevance of feature representations, and thus significantly improves the performance of the open - vocabulary segmentation task. This method is simple, efficient and easy to integrate.