Abstract:Semantic segmentation of remote sensing (RS) images is vital in various practical applications, including urban construction planning, natural disaster monitoring, and land resources investigation. However, RS images are captured by airplanes or satellites at high altitudes and long distances, resulting in ground objects of the same category being scattered in various corners of the image. Moreover, objects of different sizes appear simultaneously in RS images. For example, some objects occupy a large area in urban scenes, while others only have small regions. Technically, the above two universal situations pose significant challenges to the segmentation with a high quality for RS images. Based on these observations, this paper proposes a Mask2Former with an improved query (IQ2Former) for this task. The fundamental motivation behind the IQ2Former is to enhance the capability of the query of Mask2Former by exploiting the characteristics of RS images well. First, we propose the Query Scenario Module (QSM), which aims to learn and group the queries from feature maps, allowing the selection of distinct scenarios such as the urban and rural areas, building clusters, and parking lots. Second, we design the query position module (QPM), which is developed to assign the image position information to each query without increasing the number of parameters, thereby enhancing the model's sensitivity to small targets in complex scenarios. Finally, we propose the query attention module (QAM), which is constructed to leverage the characteristics of query attention to extract valuable features from the preceding queries. Being positioned between the duplicated transformer decoder layers, QAM ensures the comprehensive utilization of the supervisory information and the exploitation of those fine-grained details. Architecturally, the QSM, QPM, and QAM as well as an end-to-end model are assembled to achieve high-quality semantic segmentation. In comparison to the classical or state-of-the-art models (FCN, PSPNet, DeepLabV3+, OCRNet, UPerNet, MaskFormer, Mask2Former), IQ2Former has demonstrated exceptional performance across three publicly challenging remote-sensing image datasets, 83.59 mIoU on the Vaihingen dataset, 87.89 mIoU on Potsdam dataset, and 56.31 mIoU on LoveDA dataset. Additionally, overall accuracy, ablation experiment, and visualization segmentation results all indicate IQ2Former validity.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the semantic segmentation problem of remote sensing (RS) images. Specifically, remote sensing images are captured by aircraft or satellites at high altitudes and long distances, which results in ground objects of the same category being scattered across various corners of the image, and objects of different sizes appearing simultaneously in the image. For example, in urban scenes, some objects occupy large areas, while others occupy only small regions. These characteristics pose significant technical challenges for high-quality semantic segmentation. ### Main Technical Challenges 1. **Object Scattering**: Ground objects of the same category are scattered across various corners of the image. 2. **Multi-scale Objects**: Objects of different sizes appear simultaneously in the image, such as road surfaces usually occupying large areas, while cars occupy only small spaces. ### Solution To address the above issues, the authors propose an improved Mask2Former model, called IQ2Former (Improved Query for Mask2Former). This model enhances query capabilities through the following three modules, thereby better capturing the characteristics of remote sensing images: 1. **Query Scenario Module (QSM)**: - Objective: Learn and group queries to select different scenarios (e.g., urban and rural areas, building complexes, and parking lots). - Implementation: Convert feature maps into vectors through global average pooling (GAP), then generate selection weights through linear layers and the SoftMax function, and finally combine sub-queries with weighted sums. 2. **Query Position Module (QPM)**: - Objective: Assign image position information to each query without increasing the number of parameters, improving the model's sensitivity to small targets in complex scenes. - Implementation: Combine the position encoding of input image features with queries to enhance the model's position awareness. 3. **Query Attention Module (QAM)**: - Objective: Utilize query attention to extract valuable features, ensuring comprehensive use of supervision information and mining fine-grained details. - Implementation: Introduce an attention module between repeated transformer decoder layers to enhance the extraction capability of query features. ### Performance Evaluation IQ2Former was evaluated on three public remote sensing image datasets, including the Vaihingen dataset, the Potsdam dataset, and the LoveDA dataset. Experimental results show that IQ2Former performs excellently on these datasets, achieving 83.59 mIoU, 87.89 mIoU, and 56.31 mIoU, respectively. Additionally, overall accuracy, ablation experiments, and visualized segmentation results all validate the effectiveness of IQ2Former. ### Conclusion By improving the query mechanism of Mask2Former, this paper proposes the IQ2Former model, effectively addressing the issues of object scattering and multi-scale objects in remote sensing image semantic segmentation, significantly enhancing segmentation performance.

Mask2Former with Improved Query for Semantic Segmentation in Remote-Sensing Images

Maskformer with Improved Encoder-Decoder Module for Semantic Segmentation of Fine-Resolution Remote Sensing Images.

MP-Former: Mask-Piloted Transformer for Image Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation

Dynamic Focus-aware Positional Queries for Semantic Segmentation

DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries

Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

Position-Guided Point Cloud Panoptic Segmentation Transformer

Pyramid Fusion Transformer for Semantic Segmentation

Advancing high-resolution remote sensing: a compact and powerful approach to semantic segmentation

Mask-R-FCN: A Deep Fusion Network for Semantic Segmentation.

RSI-Net: Two-Stream Deep Neural Network for Remote Sensing Images-Based Semantic Segmentation

Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers

Context-Aggregated and SAM-Guided Network for ViT-Based Instance Segmentation in Remote Sensing Images

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

SDFCNv2: An Improved FCN Framework for Remote Sensing Images Semantic Segmentation

A Query-Based Network for Rural Homestead Extraction from VHR Remote Sensing Images

A Spectral–Spatial Context-Boosted Network for Semantic Segmentation of Remote Sensing Images

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

Dynamic High-Resolution Network for Semantic Segmentation in Remote-Sensing Images

OneFormer3D: One Transformer for Unified Point Cloud Segmentation