MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

Bo Zhang,Zuheng Ming,Wei Feng,Yaqian Liu,Liang He,Kaixing Zhao
2023-03-23
Abstract:To benefit the complementary information between heterogeneous data, we introduce a new Multimodal Transformer (MMFormer) for Remote Sensing (RS) image classification using Hyperspectral Image (HSI) accompanied by another source of data such as Light Detection and Ranging (LiDAR). Compared with traditional Vision Transformer (ViT) lacking inductive biases of convolutions, we first introduce convolutional layers to our MMFormer to tokenize patches from multimodal data of HSI and LiDAR. Then we propose a Multi-scale Multi-head Self-Attention (MSMHSA) module to address the problem of compatibility which often limits to fuse HSI with high spectral resolution and LiDAR with relatively low spatial resolution. The proposed MSMHSA module can incorporate HSI to LiDAR data in a coarse-to-fine manner enabling us to learn a fine-grained representation. Extensive experiments on widely used benchmarks (e.g., Trento and MUUFL) demonstrate the effectiveness and superiority of our proposed MMFormer for RS image classification.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively fuse data of different modalities in remote sensing image classification, especially hyperspectral image (HSI) and Light Detection and Ranging (LiDAR) data. Although hyperspectral images can provide rich spectral information, they are limited in distinguishing ground coverings such as roads and rooftops made of the same material. LiDAR data, by providing elevation information, can distinguish objects with the same spectral features but different heights, such as roads and rooftops built with the same material. Therefore, how to combine the advantages of these two data sources, overcome their respective limitations, and improve the accuracy of remote - sensing image classification is the core issue of this research. To solve this problem, the author proposes a new model named **Multimodal Transformer (MMFormer)**. This model processes the blocking of multimodal data by introducing convolutional layers and proposes a **Multi - scale Multi - head Self - Attention (MSMHSA)** to solve the compatibility problem when fusing data of different resolutions. The MSMHSA module can fuse hyperspectral images with LiDAR data in a coarse - to - fine hierarchy, thereby learning more refined feature representations. In addition, the author also verifies the effectiveness and superiority of the proposed MMFormer through experiments on multiple widely - used benchmark data sets, demonstrating its advanced performance in remote - sensing image classification tasks.