Abstract:Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically employed in semantic segmentation, hindering the diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC and Context datasets. Project page: <a class="link-external link-https" href="https://jylin8100.github.io/InvSegProject/" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inaccurate visual - text association in Open - Vocabulary Semantic Segmentation (OVSS) due to the difference in input text distribution. Specifically:
1. **Difference in input text distribution**: When using diffusion models for image generation, rich - content sentences are usually used; while in semantic segmentation tasks, only isolated class names are usually used. This difference makes it difficult for diffusion models to capture accurate visual - text associations.
2. **Lack of sample - specific context**: Most existing methods use isolated class names as text prompts, ignoring the image - specific context information, which limits the performance of the model.
To solve these problems, the authors propose **InvSeg**, a test - time prompt inversion method. By inverting the image - specific visual context into the text - prompt embedding space, the text prompt is optimized and the accuracy of semantic segmentation is improved. Specifically, InvSeg introduces **Contrastive Soft Clustering (CSC)** to align the predicted masks with the image structure information, ensuring that each class is associated with the mask with consistent structure.
### Formula Explanation
- **KL divergence is used to calculate the distance between pixels**:
\[
S[i, j, k, l]=\text{KL}(A_{\text{self}}[i, j] \| A_{\text{self}}[k, l])+\text{KL}(A_{\text{self}}[k, l] \| A_{\text{self}}[i, j])
\]
where \(A_{\text{self}}\) is the self - attention map.
- **Weighted distance calculation**:
\[
D((i, j), M_c)=\frac{\sum_{(k, l) \in Q_c}(S[i, j, k, l] \cdot M_c[k, l])}{\sum_{(k, l) \in Q_c} M_c[k, l]}
\]
- **Intra - class and inter - class distances**:
\[
D_{\text{intra}}=\sum_{c = 1}^{C} D(\text{Anchor}_c, M_c)
\]
\[
D_{\text{inter}}=\sum_{c'= 1}^{C - 1} \sum_{c = c' + 1}^{C} D(\text{Anchor}_c, M_{c'})
\]
- **Contrastive soft clustering loss function**:
\[
L_{\text{Cluster}}=\frac{D_{\text{intra}}}{C}-2 \cdot \frac{D_{\text{inter}}}{C \cdot (C - 1)}
\]
- **Entropy minimization loss function**:
\[
L_{\text{Etrp}}=-\sum_{c = 1}^{C} M_c \cdot \log M_c
\]
- **Total loss function**:
\[
L = L_{\text{Cluster}}+\alpha \cdot L_{\text{Etrp}}
\]
Through these methods, InvSeg can learn context - rich text prompts at test time, thereby achieving more accurate semantic alignment and achieving state - of - the - art performance on multiple datasets.