Mutually Beneficial Transformer for Multimodal Data Fusion

Jinping Wang,Xiaojun Tan
DOI: https://doi.org/10.1109/tcsvt.2023.3274545
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Multimodal feature fusion representation, e.g., hyperspectral image and light detection and ranging (HSI-LiDAR) fusion, is an essential topic for fusion perception. However, existing networks tend to employ mandatory feature stacking or local context fusion strategies between multiple modalities, ignoring the power of globally mutual-guided feature transmission. Therefore, this paper develops a mutually beneficial transformer method for multimodal data fusion (MBFormer), which contains the following steps. First, a spatial constraint-based self-attention (SCS) module. In this module, spectralwise attention and a spatialwise convolution are applied to HSI and LiDAR data individually, and then a spatial guide mask generated from LiDAR elevation information is used as an agent to bridge with HSI for spatial feature constraints. Second, a channel diversity-based transformer (CDT) module. On the basis of local spectral embedding explorations, an adaptive token-mixer mechanism is conducted on the groupwise classification token of HSI and individual LiDAR data for global information connectivity and transitivity. At last, the selected features are embedded into a classification layer for the final result calculation. Experimental results show that the proposed MBFormer can obtain 97.76% and 98.62% classification accuracies on Houston and Trento datasets, respectively, indicating the advantages and competitiveness of the MBFormer over the compared state-of-the-art methods.
engineering, electrical & electronic
What problem does this paper attempt to address?