Abstract:Vision-language models (VLMs) have demonstrated remarkable open-vocabulary object recognition capabilities, motivating their adaptation for dense prediction tasks like segmentation. However, directly applying VLMs to such tasks remains challenging due to their lack of pixel-level granularity and the limited data available for fine-tuning, leading to overfitting and poor generalization. To address these limitations, we propose Generalization Boosted Adapter (GBA), a novel adapter strategy that enhances the generalization and robustness of VLMs for open-vocabulary segmentation. GBA comprises two core components: (1) a Style Diversification Adapter (SDA) that decouples features into amplitude and phase components, operating solely on the amplitude to enrich the feature space representation while preserving semantic consistency; and (2) a Correlation Constraint Adapter (CCA) that employs cross-attention to establish tighter semantic associations between text categories and target regions, suppressing irrelevant low-frequency ``noise'' information and avoiding erroneous associations. Through the synergistic effect of the shallow SDA and the deep CCA, GBA effectively alleviates overfitting issues and enhances the semantic relevance of feature representations. As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods, demonstrating broad applicability and achieving state-of-the-art performance on multiple open-vocabulary segmentation benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively apply vision - language models (VLMs) to the open - vocabulary image segmentation task, while overcoming the problems of over - fitting easily and insufficient generalization ability of existing methods in the case of limited data. ### Problem Background Vision - language models (such as CLIP) have demonstrated strong open - vocabulary object recognition capabilities through cross - modal contrastive learning, but there are challenges in directly applying them to dense prediction tasks (such as segmentation): 1. **Lack of Pixel - level Granularity**: These models use image - level supervision signals during pre - training and cannot capture pixel - level detailed features. 2. **Limited Fine - tuning Data**: Since the scale of the data set used for fine - tuning is small, it is easy to cause over - fitting, thus affecting the generalization ability of the model. 3. **Interference of Irrelevant Information**: The training data contains a large amount of "noise" information (such as background textures and object styles) that is irrelevant to semantic categories. Direct fine - tuning may cause the model to pay too much attention to this irrelevant information and establish wrong associations. ### Solution To solve the above problems, the author proposes a new adapter strategy named Generalization Boosted Adapter (GBA). GBA consists of two core components: 1. **Style Diversification Adapter (SDA)**: - **Function**: Decompose features into amplitude and phase components through Fourier transform, and only operate on the amplitude to enrich the feature space representation while maintaining semantic consistency. - **Mechanism**: Specifically, SDA calculates the frequency - domain features of the input samples, normalizes and fuses their styles to generate samples with new styles, thereby enhancing feature diversity and maintaining content consistency. - **Formulas**: \[ a = \sqrt{F(x)_{\text{real}}^2 + F(x)_{\text{img}}^2} \] \[ p = \arctan\left(\frac{F(x)_{\text{img}}}{F(x)_{\text{real}}}\right) \] \[ \mu = W \cdot \mu_{\text{base}}, \quad \sigma = W \cdot \sigma_{\text{base}} \] \[ a_{\text{new}} = \sigma \cdot a + \mu \] \[ \tilde{x} = \text{IFFT}(\text{Compose}(a_{\text{new}}, p)) \] 2. **Correlation Constraint Adapter (CCA)**: - **Function**: Establish a closer semantic association between text categories and target regions through the cross - attention mechanism, suppress irrelevant low - frequency "noise" information, and avoid wrong associations. - **Mechanism**: CCA utilizes the strong correlation between high - frequency components (such as object edges and contours) and semantic features to guide the model to learn semantically relevant high - frequency information, thereby improving the accuracy of category matching. - **Formulas**: \[ \text{Attn}(Q_z, K, V) = \text{softmax}\left(\frac{Q_z K^T}{\sqrt{d_k}}\right)V \] \[ Q_z = \phi_q(X_z^v), \quad K = \phi_k(X_t), \quad V = \phi_v(X_t) \] ### Summary By combining the shallow SDA and the deep CCA, GBA effectively alleviates the over - fitting problem, enhances the semantic relevance of feature representations, and thus significantly improves the performance of the open - vocabulary segmentation task. This method is simple, efficient and easy to integrate.

Generalization Boosted Adapter for Open-Vocabulary Segmentation

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

Improving Zero-Shot Generalization for CLIP with Variational Adapter

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Side Adapter Network for Open-Vocabulary Semantic Segmentation

VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks

Semantic-Aware Domain Generalized Segmentation.

GSVA: Generalized Segmentation via Multimodal Large Language Models

Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

Cross-Class Domain Adaptive Semantic Segmentation with Visual Language Models

SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

MMA: Multi-Modal Adapter for Vision-Language Models

p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation