Abstract:Despite extensive research, open-vocabulary segmentation methods still struggle to generalize across diverse domains. To reduce the computational cost of adapting Vision-Language Models (VLMs) while preserving their pre-trained knowledge, most methods freeze the VLMs for mask classification and train only the mask generator. However, our comprehensive analysis reveals a surprising insight: open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation. This discovery prompts us to rethink the existing paradigm and explore an alternative approach. Instead of freezing the VLM, we propose to freeze the pre-trained mask generator and focus on optimizing the mask classifier. Building on the observation that VLMs pre-trained on global-pooled image-text features often fail to capture fine-grained semantics necessary for effective mask classification, we propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation. FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process. As our method strategically optimizes only a small portion of the VLM's parameters, it enjoys the efficiency of adapting to new data distributions while largely preserving the valuable VLM pre-trained knowledge. Extensive ablation studies confirm the superiority of our approach. Notably, FISA achieves new state-of-the-art results across multiple representative benchmarks, improving performance by up to +1.0 PQ and +3.0 mIoU and reduces training costs by nearly 5x compared to previous best methods. Our code and data will be made public.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **the insufficient generalization ability of open - vocabulary segmentation methods in different fields**. Although existing methods have made certain progress on this task, they still face the following challenges: 1. **High computational cost**: Existing open - vocabulary segmentation methods usually require a large amount of computational resources for training. 2. **Performance bottleneck**: The performance of existing methods is mainly limited by **mask classification**, rather than **mask generation**. Specifically, when visual - language models (VLMs) handle dense segmentation tasks, it is difficult for them to capture fine - grained semantic information, resulting in poor classification performance. ### Main contributions of the paper To solve the above problems, the authors propose a new framework named **Fine - grained Semantic Adaptation (FISA)**. The main innovations of FISA include: 1. **Rethinking the existing paradigm**: Unlike most existing methods that freeze the VLM and train the mask generator, FISA chooses to freeze the pre - trained mask generator and focuses on optimizing the VLM - based mask classifier. 2. **Introducing fine - grained semantic adaptation**: In order to improve the performance of mask classification, FISA enhances the fine - grained semantic information of the extracted visual features through two key mechanisms: - **Semantic - guided Visual Encoding (SEVE)**: Inject fine - grained semantic information early in the visual feature extraction process. - **Strategic Image - Mask Optimization (SIMO)**: Optimize only a small part of the VLM's parameters to maintain its pre - trained knowledge while improving its adaptability to new data distributions. ### Experimental results The experimental results show that FISA not only achieves new state - of - the - art performance on multiple benchmark datasets but also significantly reduces the training cost. Specifically: - On datasets such as ADE150 and Mapillary Vistas, FISA improves +1.0 PQ and +3.0 mIoU respectively. - Compared with the previous best method, the training cost of FISA is reduced by nearly 5 times. ### Formula representation When describing the specific implementation of SEVE, some formulas are involved. The following is the Markdown - format representation of these formulas: \[ \text{SEVE}([MASK], [IMG], [TGT])=\sigma(\hat{q}_{\text{mask}}k_{\text{img}}^T + M_{\text{mask}})\cdot v_{\text{img}} \] where, \[ \hat{q}_{\text{mask}}, k_{\text{img}}, v_{\text{img}}=f_q(\hat{[MASK]}), f_k([IMG]), f_v([IMG]) \] \[ \hat{[MASK]}=\sigma(q_{\text{mask}}k_{\text{tgt}}^T)\cdot v_{\text{tgt}} \] \[ q_{\text{mask}}, k_{\text{tgt}}, v_{\text{tgt}}=g_q([MASK]), g_k([TGT]), g_v([TGT]) \] \[ M_{\text{mask}}(i, j)= \begin{cases} 0, & \text{if mask }i\text{ contains any patch }j\text{'s pixel},\\ -\infty, & \text{otherwise} \end{cases} \] These formulas describe in detail each step in SEVE.

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Generalization Boosted Adapter for Open-Vocabulary Segmentation

Exploring Simple Open-Vocabulary Semantic Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

IFSeg: Image-free Semantic Segmentation via Vision-Language Model

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Side Adapter Network for Open-Vocabulary Semantic Segmentation

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation