Abstract:In recent years, there have been many multimodal works in the field of remote sensing, and most of them have achieved good results in the task of land-cover classification. However, multi-scale information is seldom considered in the multi-modal fusion process. Secondly, the multimodal fusion task rarely considers the application of attention mechanism, resulting in a weak representation of the fused feature. In order to better use the multimodal data and reduce the losses caused by the fusion of different modalities, we proposed a TRMSF (Transformer and Multi-scale fusion) network for land-cover classification based on HSI (hyperspectral images) and LiDAR (Light Detection and Ranging) images joint classification. The network enhances multimodal information fusion ability by the method of attention mechanism from Transformer and enhancement using multi-scale information to fuse features from different modal structures. The network consists of three parts: multi-scale attention enhancement module (MSAE), multimodality fusion module (MMF) and multi-output module (MOM). MSAE enhances the ability of feature representation from extracting different multi-scale features of HSI, which are used to fuse with LiDAR feature, respectively. MMF integrates the data of different modalities through attention mechanism, thereby reducing the loss caused by the data fusion of different modal structures. MOM optimizes the network by controlling different outputs and enhances the stability of the results. The experimental results show that the proposed network is effective in multimodality joint classification.

Multimodal Fusion with Co-attention Mechanism

Multimodal Fusion Method Based on Self-Attention Mechanism

Multi-Feature Fusion Multi-Modal Sentiment Analysis Model Based on Cross-Attention Mechanism

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Dual Low-Rank Multimodal Fusion

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention

Deep Multimodal Data Fusion

Mutually Beneficial Transformer for Multimodal Data Fusion

Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis

Context-Dependent Multimodal Sentiment Analysis Based on a Complex Attention Mechanism

Countering Modal Redundancy and Heterogeneity: A Self-Correcting Multimodal Fusion

MEFusion: Unsupervised Mutual Enhancement for Multimodal Image Fusion

Attention Fusion of Transformer-Based and Scale-Based Method for Hyperspectral and LiDAR Joint Classification

Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features

Attention is not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion

Tri-Modalities Fusion for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism

Learning Joint Multimodal Representation Based On Multi-Fusion Deep Neural Networks

Interpretation on Multi-modal Visual Fusion

Attention Bottlenecks for Multimodal Fusion