Enhancing Query Formulation for Universal Image Segmentation

Yipeng Qu,Joohee Kim

DOI: https://doi.org/10.3390/s24061879

IF: 3.9

2024-03-15

Sensors

Abstract:Recent advancements in image segmentation have been notably driven by Vision Transformers. These transformer-based models offer one versatile network structure capable of handling a variety of segmentation tasks. Despite their effectiveness, the pursuit of enhanced capabilities often leads to more intricate architectures and greater computational demands. OneFormer has responded to these challenges by introducing a query-text contrastive learning strategy active during training only. However, this approach has not completely addressed the inefficiency issues in text generation and the contrastive loss computation. To solve these problems, we introduce Efficient Query Optimizer (EQO), an approach that efficiently utilizes multi-modal data to refine query optimization in image segmentation. Our strategy significantly reduces the complexity of parameters and computations by distilling inter-class and inter-task information from an image into a single template sentence. Furthermore, we propose a novel attention-based contrastive loss. It is designed to facilitate a one-to-many matching mechanism in the loss computation, which helps object queries learn more robust representations. Beyond merely reducing complexity, our model demonstrates superior performance compared to OneFormer across all three segmentation tasks using the Swin-T backbone. Our evaluations on the ADE20K dataset reveal that our model outperforms OneFormer in multiple metrics: by 0.2% in mean Intersection over Union (mIoU), 0.6% in Average Precision (AP), and 0.8% in Panoptic Quality (PQ). These results highlight the efficacy of our model in advancing the field of image segmentation.

engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the efficiency and effectiveness of query optimization in general - purpose image segmentation tasks. Specifically, the paper addresses two main problems existing in existing methods during the training process: 1. **Redundant Text Generation**: Existing methods contain a large amount of redundant information when generating text lists. This information contributes limitedly to guiding object queries to recognize objects of different classes, resulting in an increase in additional parameters and computational costs. 2. **One - to - One Matching Mechanism in Contrastive Loss Calculation**: The traditional contrastive loss calculation adopts a one - to - one matching method. This method limits the ability of object queries to learn more powerful representations because each object query can only be associated with one specific class or object. To solve these problems, the paper proposes the **Efficient Query Optimizer (EQO)**. This method improves the query optimization process in the following two aspects: - **Efficient Text Generation**: EQO simplifies the text list by integrating all semantic cues into a single sentence, retaining the necessary cross - class and cross - task information, thereby significantly reducing the number of parameters and computational complexity. - **Attention - Based Contrastive Loss**: A new attention - based contrastive loss calculation method is introduced, which supports a one - to - many matching mechanism, enabling each object query to learn representations of multiple classes, thereby improving the robustness and performance of the model. The experimental results of the paper show that the proposed EQO significantly improves the performance of the model on the ADE20K dataset, especially in terms of mean Intersection over Union (mIoU), Average Precision (AP) and Panoptic Quality (PQ), with increases of 0.2%, 0.6% and 0.8% respectively. This indicates that EQO has obvious advantages in improving the efficiency and accuracy of general - purpose image segmentation tasks.

Enhancing Query Formulation for Universal Image Segmentation

MixFormer: a Mixed CNN-Transformer Backbone for Medical Image Segmentation

Mixed-Query Transformer: A Unified Image Segmentation Architecture

Learning Equivariant Segmentation with Instance-Unique Querying

OneFormer3D: One Transformer for Unified Point Cloud Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation

MP-Former: Mask-Piloted Transformer for Image Segmentation

Mask2Former with Improved Query for Semantic Segmentation in Remote-Sensing Images

Improving Object-centric Learning with Query Optimization

SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression

CQformer: Learning Dynamics Across Slices in Medical Image Segmentation

TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization

CompetitorFormer: Competitor Transformer for 3D Instance Segmentation

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers

CardiacSegFormer: Transformer for Semantic Segmentation of Cardiac Images.

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting