Abstract:Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs). As a result, many researchers have tried to incorporate ViTs in hyperspectral image (HSI) classification tasks. To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters. ViTs and other similar transformers use an external classification (CLS) token which is randomly initialized and often fails to generalize well, whereas other sources of multimodal datasets, such as light detection and ranging (LiDAR) offer the potential to improve these models by means of a CLS. In this paper, we introduce a new multimodal fusion transformer (MFT) network which comprises a multihead cross patch attention (mCrossPA) for HSI land-cover classification. Our mCrossPA utilizes other sources of complementary information in addition to the HSI in the transformer encoder to achieve better generalization. The concept of tokenization is used to generate CLS and HSI patch tokens, helping to learn a {distinctive representation} in a reduced and hierarchical feature space. Extensive experiments are carried out on {widely used benchmark} datasets {i.e.,} the University of Houston, Trento, University of Southern Mississippi Gulfpark (MUUFL), and Augsburg. We compare the results of the proposed MFT model with other state-of-the-art transformers, classical CNNs, and conventional classifiers models. The superior performance achieved by the proposed model is due to the use of multihead cross patch attention. The source code will be made available publicly at \url{<a class="link-external link-https" href="https://github.com/AnkurDeria/MFT" rel="external noopener nofollow">this https URL</a>}.}

Supervised Multimodal Bitransformers for Classifying Images and Text

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

Multimodal Transformer For Multimodal Machine Translation

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Improving Unimodal Inference with Multimodal Transformers

Multimodal Neurons in Pretrained Text-Only Transformers

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Cross-modal sentiment analysis based on Transformer and image-text collaborative interaction

Multimodal Fusion Transformer for Remote Sensing Image Classification

Transformer-Based Classification Outcome Prediction for Multimodal Stroke Treatment

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Multimodal Image Fusion Via Self-Supervised Transformer

Multimodal Transformer for Unaligned Multimodal Language Sequences

Factorized Multimodal Transformer for Multimodal Sequential Learning

Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Multimodal Token Fusion for Vision Transformers

Feature Fusion Based on Transformer for Cross-modal Retrieval

Multimodal tweet classification in disaster response systems using transformer-based bidirectional attention model