Abstract:The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, even with extreme class imbalance issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution, and dual local module (\ie the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.

Unitr: A unified and efficient multi-modal transformer for bird's-eye-view representation

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

UniVision: A Unified Framework for Vision-Centric 3D Perception

UniTR: A Unified TRansformer-based Framework for Co-object and Multi-modal Saliency Detection

Uni3DETR: Unified 3D Detection Transformer

MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

MRFTrans: Multimodal Representation Fusion Transformer for monocular 3D semantic scene completion

RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception

UniWorld: Autonomous Driving Pre-training via World Models

UniHead: Unifying Multi-Perception for Detection Heads

UniDrive: Towards Universal Driving Perception Across Camera Configurations

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving

A Unified Framework for 3D Scene Understanding

Unifying Visual Perception by Dispersible Points Learning

UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation