Abstract:The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.

Common-Specific Multimodal Learning for Deep Belief Network

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Dense Multimodal Fusion for Hierarchically Joint Representation

Multimodal correlation deep belief networks for multi-view classification

Multi-Source Heterogeneous Iris Recognition Using Stacked Convolutional Deep Belief Networks-Deep Belief Network Model

MC-DBN: A Deep Belief Network-Based Model for Modality Completion

Learn to Combine Modalities in Multimodal Deep Learning

BM-NAS: Bilevel Multimodal Neural Architecture Search

Learning Document Semantic Representation with Hybrid Deep Belief Network.

Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks.

DFN: A deep fusion network for flexible single and multi-modal action recognition

Deep Multimodal Fusion by Channel Exchanging

Multi-modal fusion network guided by prior knowledge for 3D CAD model recognition

Deep Multimodal Data Fusion

Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Deep Class-Specific Affinity-Guided Convolutional Network for Multimodal Unpaired Image Segmentation

Dual-level Deep Evidential Fusion: Integrating Multimodal Information for Enhanced Reliable Decision-Making in Deep Learning

Deep Multimodal Network for Multi-Label Classification.

Generalized Bilinear Deep Convolutional Neural Networks for Multimodal Biometric Identification

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data