Abstract:Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel alignment loss to refine the coarse-grained alignment from CLIP, achieving finer-grained pixel-text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo-masks during forward propagation and propose Category Information Supplementation module. These pseudo-masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel-level alignment and category boundary information, addressing key challenges in open-vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open-vocabulary semantic segmentation benchmarks.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the key challenges in open - vocabulary segmentation (OVS). Specifically, the goal of the OVS task is to perform pixel - level classification and segmentation of images according to text descriptions, not limited to a predefined set of classes. However, existing vision - language models (VLMs), such as CLIP, are usually pre - trained for visual - text alignment at the image level, which leads to their deficiencies when dealing with segmentation tasks that require fine - grained pixel - level alignment. #### Main problems: 1. **Fine - grained pixel - text alignment**: - Existing VLMs mainly focus on global semantic features and cannot provide the required fine - grained pixel - level alignment and detailed class boundary information. - The information directly extracted from VLMs cannot meet the requirements of segmentation tasks. 2. **Preservation of class boundary information**: - Segmentation tasks require not only accurate pixel - level alignment but also rich class boundary information to ensure accurate segmentation results. - Current models often fail to fully preserve class boundary information while achieving fine - grained pixel - level alignment. #### Solutions: To solve the above problems, the authors propose the FGAseg framework, which enhances pixel - level alignment and class boundary information through the following two core modules: 1. **Pixel - Level Alignment Module**: - **Pixel - Text Alignment Transformer (P2Tformer)**: Use the cross - attention mechanism to perform fine - grained alignment of visual information with text descriptions. - **Text - Pixel Alignment Loss (T2Ploss)**: Introduce a loss function to guide the visual encoder to achieve more accurate text - pixel alignment while maintaining pre - trained image - text alignment. 2. **Category Supplementation Propagation Module**: - **Global Category Supplementation (GCS)**: Use cosine similarity to capture global class boundary information. - **Local Category Supplementation (LCS)**: Capture local detailed information through convolutional similarity to further enrich class boundary information. By combining these two modules, FGAseg effectively improves the quality of pixel - level alignment and class boundary information, thus solving the key challenges in open - vocabulary semantic segmentation. ### Summary The main contribution of this paper is to propose a new framework, FGAseg. By introducing the pixel - level alignment module and the category supplementation propagation module, it solves the deficiencies of existing VLMs in open - vocabulary semantic segmentation tasks, especially in terms of fine - grained pixel - text alignment and preservation of class boundary information. Experimental results show that FGAseg performs excellently on multiple commonly - used datasets and outperforms existing methods.

FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation

Fine-Grained Visual Categorization With Fine-Tuned Segmentation

CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Exploring Simple Open-Vocabulary Semantic Segmentation

Generalization Boosted Adapter for Open-Vocabulary Segmentation

Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation