Abstract:Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically employed in semantic segmentation, hindering the diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC and Context datasets. Project page: <a class="link-external link-https" href="https://jylin8100.github.io/InvSegProject/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inaccurate visual - text association in Open - Vocabulary Semantic Segmentation (OVSS) due to the difference in input text distribution. Specifically: 1. **Difference in input text distribution**: When using diffusion models for image generation, rich - content sentences are usually used; while in semantic segmentation tasks, only isolated class names are usually used. This difference makes it difficult for diffusion models to capture accurate visual - text associations. 2. **Lack of sample - specific context**: Most existing methods use isolated class names as text prompts, ignoring the image - specific context information, which limits the performance of the model. To solve these problems, the authors propose **InvSeg**, a test - time prompt inversion method. By inverting the image - specific visual context into the text - prompt embedding space, the text prompt is optimized and the accuracy of semantic segmentation is improved. Specifically, InvSeg introduces **Contrastive Soft Clustering (CSC)** to align the predicted masks with the image structure information, ensuring that each class is associated with the mask with consistent structure. ### Formula Explanation - **KL divergence is used to calculate the distance between pixels**: \[ S[i, j, k, l]=\text{KL}(A_{\text{self}}[i, j] \| A_{\text{self}}[k, l])+\text{KL}(A_{\text{self}}[k, l] \| A_{\text{self}}[i, j]) \] where \(A_{\text{self}}\) is the self - attention map. - **Weighted distance calculation**: \[ D((i, j), M_c)=\frac{\sum_{(k, l) \in Q_c}(S[i, j, k, l] \cdot M_c[k, l])}{\sum_{(k, l) \in Q_c} M_c[k, l]} \] - **Intra - class and inter - class distances**: \[ D_{\text{intra}}=\sum_{c = 1}^{C} D(\text{Anchor}_c, M_c) \] \[ D_{\text{inter}}=\sum_{c'= 1}^{C - 1} \sum_{c = c' + 1}^{C} D(\text{Anchor}_c, M_{c'}) \] - **Contrastive soft clustering loss function**: \[ L_{\text{Cluster}}=\frac{D_{\text{intra}}}{C}-2 \cdot \frac{D_{\text{inter}}}{C \cdot (C - 1)} \] - **Entropy minimization loss function**: \[ L_{\text{Etrp}}=-\sum_{c = 1}^{C} M_c \cdot \log M_c \] - **Total loss function**: \[ L = L_{\text{Cluster}}+\alpha \cdot L_{\text{Etrp}} \] Through these methods, InvSeg can learn context - rich text prompts at test time, thereby achieving more accurate semantic alignment and achieving state - of - the - art performance on multiple datasets.

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation

ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation

Visual Prompt Selection for In-Context Learning Segmentation

SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation

Prompting Diffusion Representations for Cross-Domain Semantic Segmentation

Exploring Effective Factors for Improving Visual In-Context Learning

MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

Semantic Segmentation Via Structured Patch Prediction, Context Crf And Guidance Crf

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

A Study of Test-time Contrastive Concepts for Open-world, Open-vocabulary Semantic Segmentation

CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting

iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Segmentation-Free Guidance for Text-to-Image Diffusion Models

SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation