FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation

Bingyu Li,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
2025-01-01
Abstract:Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel alignment loss to refine the coarse-grained alignment from CLIP, achieving finer-grained pixel-text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo-masks during forward propagation and propose Category Information Supplementation module. These pseudo-masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel-level alignment and category boundary information, addressing key challenges in open-vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open-vocabulary semantic segmentation benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the key challenges in open - vocabulary segmentation (OVS). Specifically, the goal of the OVS task is to perform pixel - level classification and segmentation of images according to text descriptions, not limited to a predefined set of classes. However, existing vision - language models (VLMs), such as CLIP, are usually pre - trained for visual - text alignment at the image level, which leads to their deficiencies when dealing with segmentation tasks that require fine - grained pixel - level alignment. #### Main problems: 1. **Fine - grained pixel - text alignment**: - Existing VLMs mainly focus on global semantic features and cannot provide the required fine - grained pixel - level alignment and detailed class boundary information. - The information directly extracted from VLMs cannot meet the requirements of segmentation tasks. 2. **Preservation of class boundary information**: - Segmentation tasks require not only accurate pixel - level alignment but also rich class boundary information to ensure accurate segmentation results. - Current models often fail to fully preserve class boundary information while achieving fine - grained pixel - level alignment. #### Solutions: To solve the above problems, the authors propose the FGAseg framework, which enhances pixel - level alignment and class boundary information through the following two core modules: 1. **Pixel - Level Alignment Module**: - **Pixel - Text Alignment Transformer (P2Tformer)**: Use the cross - attention mechanism to perform fine - grained alignment of visual information with text descriptions. - **Text - Pixel Alignment Loss (T2Ploss)**: Introduce a loss function to guide the visual encoder to achieve more accurate text - pixel alignment while maintaining pre - trained image - text alignment. 2. **Category Supplementation Propagation Module**: - **Global Category Supplementation (GCS)**: Use cosine similarity to capture global class boundary information. - **Local Category Supplementation (LCS)**: Capture local detailed information through convolutional similarity to further enrich class boundary information. By combining these two modules, FGAseg effectively improves the quality of pixel - level alignment and class boundary information, thus solving the key challenges in open - vocabulary semantic segmentation. ### Summary The main contribution of this paper is to propose a new framework, FGAseg. By introducing the pixel - level alignment module and the category supplementation propagation module, it solves the deficiencies of existing VLMs in open - vocabulary semantic segmentation tasks, especially in terms of fine - grained pixel - text alignment and preservation of class boundary information. Experimental results show that FGAseg performs excellently on multiple commonly - used datasets and outperforms existing methods.