Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Yong Xien Chng,Xuchong Qiu,Yizeng Han,Kai Ding,Wan Ding,Gao Huang
2024-12-09
Abstract:Despite extensive research, open-vocabulary segmentation methods still struggle to generalize across diverse domains. To reduce the computational cost of adapting Vision-Language Models (VLMs) while preserving their pre-trained knowledge, most methods freeze the VLMs for mask classification and train only the mask generator. However, our comprehensive analysis reveals a surprising insight: open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation. This discovery prompts us to rethink the existing paradigm and explore an alternative approach. Instead of freezing the VLM, we propose to freeze the pre-trained mask generator and focus on optimizing the mask classifier. Building on the observation that VLMs pre-trained on global-pooled image-text features often fail to capture fine-grained semantics necessary for effective mask classification, we propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation. FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process. As our method strategically optimizes only a small portion of the VLM's parameters, it enjoys the efficiency of adapting to new data distributions while largely preserving the valuable VLM pre-trained knowledge. Extensive ablation studies confirm the superiority of our approach. Notably, FISA achieves new state-of-the-art results across multiple representative benchmarks, improving performance by up to +1.0 PQ and +3.0 mIoU and reduces training costs by nearly 5x compared to previous best methods. Our code and data will be made public.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the insufficient generalization ability of open - vocabulary segmentation methods in different fields**. Although existing methods have made certain progress on this task, they still face the following challenges: 1. **High computational cost**: Existing open - vocabulary segmentation methods usually require a large amount of computational resources for training. 2. **Performance bottleneck**: The performance of existing methods is mainly limited by **mask classification**, rather than **mask generation**. Specifically, when visual - language models (VLMs) handle dense segmentation tasks, it is difficult for them to capture fine - grained semantic information, resulting in poor classification performance. ### Main contributions of the paper To solve the above problems, the authors propose a new framework named **Fine - grained Semantic Adaptation (FISA)**. The main innovations of FISA include: 1. **Rethinking the existing paradigm**: Unlike most existing methods that freeze the VLM and train the mask generator, FISA chooses to freeze the pre - trained mask generator and focuses on optimizing the VLM - based mask classifier. 2. **Introducing fine - grained semantic adaptation**: In order to improve the performance of mask classification, FISA enhances the fine - grained semantic information of the extracted visual features through two key mechanisms: - **Semantic - guided Visual Encoding (SEVE)**: Inject fine - grained semantic information early in the visual feature extraction process. - **Strategic Image - Mask Optimization (SIMO)**: Optimize only a small part of the VLM's parameters to maintain its pre - trained knowledge while improving its adaptability to new data distributions. ### Experimental results The experimental results show that FISA not only achieves new state - of - the - art performance on multiple benchmark datasets but also significantly reduces the training cost. Specifically: - On datasets such as ADE150 and Mapillary Vistas, FISA improves +1.0 PQ and +3.0 mIoU respectively. - Compared with the previous best method, the training cost of FISA is reduced by nearly 5 times. ### Formula representation When describing the specific implementation of SEVE, some formulas are involved. The following is the Markdown - format representation of these formulas: \[ \text{SEVE}([MASK], [IMG], [TGT])=\sigma(\hat{q}_{\text{mask}}k_{\text{img}}^T + M_{\text{mask}})\cdot v_{\text{img}} \] where, \[ \hat{q}_{\text{mask}}, k_{\text{img}}, v_{\text{img}}=f_q(\hat{[MASK]}), f_k([IMG]), f_v([IMG]) \] \[ \hat{[MASK]}=\sigma(q_{\text{mask}}k_{\text{tgt}}^T)\cdot v_{\text{tgt}} \] \[ q_{\text{mask}}, k_{\text{tgt}}, v_{\text{tgt}}=g_q([MASK]), g_k([TGT]), g_v([TGT]) \] \[ M_{\text{mask}}(i, j)= \begin{cases} 0, & \text{if mask }i\text{ contains any patch }j\text{'s pixel},\\ -\infty, & \text{otherwise} \end{cases} \] These formulas describe in detail each step in SEVE.