Abstract:The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, even with extreme class imbalance issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution, and dual local module (\ie the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.

What problem does this paper attempt to address?

This paper attempts to address the issue that general multimodal transformers (such as PerceiverIO) perform worse than specially designed architectures (such as UNetFormer) in the task of semantic segmentation of high-resolution remote sensing images. Specifically, the paper points out that PerceiverIO has deficiencies in detecting small objects (such as cars) and in the fusion of data from different modalities. ### Main Issues: 1. **Small Object Detection**: PerceiverIO performs poorly in detecting small objects (such as cars). Specifically, in the Vaihingen and Potsdam datasets, PerceiverIO fails to detect cars. 2. **Multimodal Data Fusion**: PerceiverIO fails to effectively fuse data from different modalities (such as RGB, DSM, SAR, etc.), leading to a decline in performance. ### Solutions: To overcome these issues, the authors propose the following two main contributions: 1. **Contribution 1**: Introduce a convolution-based preprocessing component to help detect small objects. By adding an additional 2D convolutional layer before the input data enters the cross-attention head of PerceiverIO, the detection capability for small objects (such as cars) is significantly improved. 2. **Contribution 2**: Propose a volumetric-aware preprocessing component to better utilize the synergy between different modalities. By using 3D convolutional kernels, this module can learn the interactions between different modality data, thereby improving overall performance. ### Experimental Results: - **Quantitative Results**: Experimental results show that with the proposed preprocessing components, PerceiverIO's performance on the Vaihingen and Potsdam datasets is significantly improved, especially in the detection of the car category. - **Qualitative Results**: Visualization results indicate that the improved PerceiverIO not only shows improvement in small object detection but also provides more realistic overall predictions, reducing edge misclassification issues. ### Conclusion: By introducing spatial and volumetric-aware preprocessing components, general multimodal transformers (such as PerceiverIO) exhibit performance comparable to specially designed architectures (such as UNetFormer and SwinUNet) in the task of semantic segmentation of high-resolution remote sensing images. However, there is still room for improvement, particularly in the precise detection of small objects. Future work will explore self-supervised and weakly supervised learning methods to leverage existing sparse data labels.

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

In Defense Of Multi-Source Omni-Supervised Efficient Convnet For Robust Semantic Segmentation In Heterogeneous Unseen Domains

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

Semi-Supervised Adversarial Semantic Segmentation Network Using Transformer and Multiscale Convolution for High-Resolution Remote Sensing Imagery

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Dual-resolution Transformer Combined with Multi-Layer Separable Convolution Fusion Network for Real-Time Semantic Segmentation

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images

A Multi-Attention UNet for Semantic Segmentation in Remote Sensing Images

UNeXt: An Efficient Network for the Semantic Segmentation of High-Resolution Remote Sensing Images

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

Multi-Attention-Network for Semantic Segmentation of Fine Resolution Remote Sensing Images

Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images

Enhancing Multiscale Representations with Transformer for Remote Sensing Image Semantic Segmentation

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

MCAT-UNet: Convolutional and Cross-Shaped Window Attention Enhanced UNet for Efficient High-Resolution Remote Sensing Image Segmentation

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery