Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Byeonghyun Pak,Byeongju Woo,Sunghwan Kim,Dae-hwan Kim,Hoseong Kim
2024-07-31
Abstract:In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5$\rightarrow$Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at <a class="link-external link-https" href="https://byeonghyunpak.github.io/tqdm" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of **Domain Generalized Semantic Segmentation (DGSS)**. Specifically, the author proposes a method of using text embeddings in Vision - Language Models (VLMs) to generate domain - invariant semantic knowledge, in order to improve the generalization ability of the model in unseen domains. #### Main problem description 1. **Domain generalization challenges**: Traditional methods perform poorly when dealing with unseen domains, especially in cases of extreme domain changes (such as sketch - style, ancient Egyptian - style, etc.). How to make the model rely only on single - source - domain data during the training process and be able to effectively adapt to the data of multiple target domains during testing is a major challenge. 2. **Semantic consistency**: Most existing methods focus on learning domain - invariant representations of visual features, while ignoring the understanding of high - level semantic concepts. How to ensure that the model can understand and recognize the inherent semantic information of categories in different domains is also a key issue. 3. **Direct use of text information**: Although previous studies have attempted to apply VLMs to DGSS, most methods do not directly use text information as the basis for object queries. Therefore, how to design an object - query mechanism directly based on text embeddings to achieve better domain - generalization performance is the focus of this paper. ### Overview of solutions To solve the above problems, the author proposes the following innovations: 1. **Text - driven object queries**: Use text embeddings in VLMs as textual object queries. These queries have domain - invariant semantic knowledge and can be used for pixel grouping and segmentation tasks. 2. **Textual Query - Driven Mask Transformer (tqdm)**: By introducing a novel framework tqdm, combine textual object queries and three regularization loss functions to enhance the domain - generalization ability and semantic clarity of the model. 3. **Experimental verification**: Through experiments on multiple benchmark datasets, the effectiveness of the proposed method is proved, especially its superior performance under extreme domain changes. ### Formula representation To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: 1. **Text embedding generation**: \[ t_k = E_T([p, \{\text{class } k\}]) \] where \( E_T \) is the frozen text encoder, \( p \) is the learnable prompt, and \( t_k\in\mathbb{R}^C \) is the text embedding of the \( k \) - th class. 2. **Initial text object query generation**: \[ q_t^0=\text{MLP}(t) \] where \( t = \{t_k\}_{k = 1}^K\in\mathbb{R}^{K\times C} \), and \( q_t^0\in\mathbb{R}^{K\times D} \) is the initial text object query. 3. **Text - to - pixel attention mechanism**: \[ W=\text{softmax}(Q_zK_t^{\top}) \] \[ z\leftarrow z + WV_t \] where \( Q_z\in\mathbb{R}^{L\times D} \) is the pixel feature, \( K_t, V_t\in\mathbb{R}^{K\times D} \) are the keys and values of the text clustering centers, and \( W\in\mathbb{R}^{L\times K} \) is the attention weight. 4. **Regularization loss**: - Language regularization loss: \[ L_{\text{reg}}^L=\text{Cro}