MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

Bo Zhang,Zuheng Ming,Wei Feng,Yaqian Liu,Liang He,Kaixing Zhao

2023-03-23

Abstract:To benefit the complementary information between heterogeneous data, we introduce a new Multimodal Transformer (MMFormer) for Remote Sensing (RS) image classification using Hyperspectral Image (HSI) accompanied by another source of data such as Light Detection and Ranging (LiDAR). Compared with traditional Vision Transformer (ViT) lacking inductive biases of convolutions, we first introduce convolutional layers to our MMFormer to tokenize patches from multimodal data of HSI and LiDAR. Then we propose a Multi-scale Multi-head Self-Attention (MSMHSA) module to address the problem of compatibility which often limits to fuse HSI with high spectral resolution and LiDAR with relatively low spatial resolution. The proposed MSMHSA module can incorporate HSI to LiDAR data in a coarse-to-fine manner enabling us to learn a fine-grained representation. Extensive experiments on widely used benchmarks (e.g., Trento and MUUFL) demonstrate the effectiveness and superiority of our proposed MMFormer for RS image classification.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively fuse data of different modalities in remote sensing image classification, especially hyperspectral image (HSI) and Light Detection and Ranging (LiDAR) data. Although hyperspectral images can provide rich spectral information, they are limited in distinguishing ground coverings such as roads and rooftops made of the same material. LiDAR data, by providing elevation information, can distinguish objects with the same spectral features but different heights, such as roads and rooftops built with the same material. Therefore, how to combine the advantages of these two data sources, overcome their respective limitations, and improve the accuracy of remote - sensing image classification is the core issue of this research. To solve this problem, the author proposes a new model named **Multimodal Transformer (MMFormer)**. This model processes the blocking of multimodal data by introducing convolutional layers and proposes a **Multi - scale Multi - head Self - Attention (MSMHSA)** to solve the compatibility problem when fusing data of different resolutions. The MSMHSA module can fuse hyperspectral images with LiDAR data in a coarse - to - fine hierarchy, thereby learning more refined feature representations. In addition, the author also verifies the effectiveness and superiority of the proposed MMFormer through experiments on multiple widely - used benchmark data sets, demonstrating its advanced performance in remote - sensing image classification tasks.

MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

Multiscale 3-D-2-D Mixed CNN and Lightweight Attention-Free Transformer for Hyperspectral and LiDAR Classification

A multimodal hyper-fusion transformer for remote sensing image classification

MASSFormer: Memory-Augmented Spectral-Spatial Transformer for Hyperspectral Image Classification

Multimodal Fusion Transformer for Remote Sensing Image Classification

MHST: Multiscale Head Selection Transformer for Hyperspectral and LiDAR Classification

MHIAIFormer: Multi-Head Interacted and Adaptive Integrated Transformer with Spatial-Spectral Attention for Hyperspectral Image Classification

MHIAIFormer: Multihead Interacted and Adaptive Integrated Transformer With Spatial-Spectral Attention for Hyperspectral Image Classification

Mutually Beneficial Transformer for Multimodal Data Fusion

MSMT-LCL: Multiscale Spatial-Spectral Masked Transformer With Local Contrastive Learning for Hyperspectral Image Classification

MHCFormer: Multiscale Hierarchical Conv-Aided Fourierformer for Hyperspectral Image Classification

Joint Classification of Hyperspectral and LiDAR Data Based on Adaptive Gating Mechanism and Learnable Transformer

State Space Models Meet Transformers for Hyperspectral Image Classification

When Multigranularity Meets Spatial–Spectral Attention: A Hybrid Transformer for Hyperspectral Image Classification

Hyperspectral Image Classification Based on Multibranch Attention Transformer Networks

Hyperspectral Image Classification based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms

Multiscanning-Based RNN–Transformer for Hyperspectral Image Classification

Joint Classification of Hyperspectral Images and LiDAR Data Based on Dual-Branch Transformer

A Lightweight Transformer Network for Hyperspectral Image Classification

MultiScale spectral–spatial convolutional transformer for hyperspectral image classification

Hyperspectral Remote-Sensing Classification Combining Transformer and Multiscale Residual Mechanisms