Abstract:In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5$\rightarrow$Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at <a class="link-external link-https" href="https://byeonghyunpak.github.io/tqdm" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of **Domain Generalized Semantic Segmentation (DGSS)**. Specifically, the author proposes a method of using text embeddings in Vision - Language Models (VLMs) to generate domain - invariant semantic knowledge, in order to improve the generalization ability of the model in unseen domains. #### Main problem description 1. **Domain generalization challenges**: Traditional methods perform poorly when dealing with unseen domains, especially in cases of extreme domain changes (such as sketch - style, ancient Egyptian - style, etc.). How to make the model rely only on single - source - domain data during the training process and be able to effectively adapt to the data of multiple target domains during testing is a major challenge. 2. **Semantic consistency**: Most existing methods focus on learning domain - invariant representations of visual features, while ignoring the understanding of high - level semantic concepts. How to ensure that the model can understand and recognize the inherent semantic information of categories in different domains is also a key issue. 3. **Direct use of text information**: Although previous studies have attempted to apply VLMs to DGSS, most methods do not directly use text information as the basis for object queries. Therefore, how to design an object - query mechanism directly based on text embeddings to achieve better domain - generalization performance is the focus of this paper. ### Overview of solutions To solve the above problems, the author proposes the following innovations: 1. **Text - driven object queries**: Use text embeddings in VLMs as textual object queries. These queries have domain - invariant semantic knowledge and can be used for pixel grouping and segmentation tasks. 2. **Textual Query - Driven Mask Transformer (tqdm)**: By introducing a novel framework tqdm, combine textual object queries and three regularization loss functions to enhance the domain - generalization ability and semantic clarity of the model. 3. **Experimental verification**: Through experiments on multiple benchmark datasets, the effectiveness of the proposed method is proved, especially its superior performance under extreme domain changes. ### Formula representation To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: 1. **Text embedding generation**: \[ t_k = E_T([p, \{\text{class } k\}]) \] where $ E_T $ is the frozen text encoder, $ p $ is the learnable prompt, and $ t_k\in\mathbb{R}^C $ is the text embedding of the $ k $ - th class. 2. **Initial text object query generation**: \[ q_t^0=\text{MLP}(t) \] where $ t = \{t_k\}_{k = 1}^K\in\mathbb{R}^{K\times C} $, and $ q_t^0\in\mathbb{R}^{K\times D} $ is the initial text object query. 3. **Text - to - pixel attention mechanism**: \[ W=\text{softmax}(Q_zK_t^{\top}) \] \[ z\leftarrow z + WV_t \] where $ Q_z\in\mathbb{R}^{L\times D} $ is the pixel feature, $ K_t, V_t\in\mathbb{R}^{K\times D} $ are the keys and values of the text clustering centers, and $ W\in\mathbb{R}^{L\times K} $ is the attention weight. 4. **Regularization loss**: - Language regularization loss: \[ L_{\text{reg}}^L=\text{Cro}

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation

SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

Position-Guided Point Cloud Panoptic Segmentation Transformer

Dynamic Focus-aware Positional Queries for Semantic Segmentation

Character Queries: A Transformer-based Approach to On-Line Handwritten Character Segmentation

Mask2Former with Improved Query for Semantic Segmentation in Remote-Sensing Images

One-Shot Domain Adaptive and Generalizable Semantic Segmentation with Class-Aware Cross-Domain Transformers

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

Masked-attention Mask Transformer for Universal Image Segmentation

Pyramid Fusion Transformer for Semantic Segmentation

Mean Shift Mask Transformer for Unseen Object Instance Segmentation

Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization

Maskformer with Improved Encoder-Decoder Module for Semantic Segmentation of Fine-Resolution Remote Sensing Images.

MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization

GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Mixed-Query Transformer: A Unified Image Segmentation Architecture

Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation