Abstract:Multimodal sentiment analysis has been studied under the assumption that all modalities are available. However, such a strong assumption does not always hold in practice, and most of multimodal fusion models may fail when partial modalities are missing. Several works have addressed the missing modality problem; but most of them only considered the single modality missing case, and ignored the practically more general cases of multiple modalities missing. To this end, in this paper, we propose a Tag-Assisted Transformer Encoder (TATE) network to handle the problem of missing uncertain modalities. Specifically, we design a tag encoding module to cover both the single modality and multiple modalities missing cases, so as to guide the network's attention to those missing modalities. Besides, we adopt a new space projection pattern to align common vectors. Then, a Transformer encoder-decoder network is utilized to learn the missing modality features. At last, the outputs of the Transformer encoder are used for the final sentiment classification. Extensive experiments are conducted on CMU-MOSI and IEMOCAP datasets, showing that our method can achieve significant improvements compared with several baselines.

What problem does this paper attempt to address?

This paper attempts to solve the problem of missing partial - modal data in multimodal sentiment analysis. Specifically, most existing multimodal fusion models assume that all modalities are available during training and testing, but in practical applications, situations where certain modal data are missing are often encountered. For example, visual features are lost due to insufficient camera coverage, acoustic information is unavailable due to excessive environmental noise, and text information is missing due to privacy issues. Therefore, how to handle the missing modalities in multimodal data has become a research hotspot. The paper points out that although some previous works have attempted to solve the problem of a single - modal missing, these methods usually ignore the more common situation of multiple modalities being missing simultaneously. That is, they need to train a new model separately for each case of missing modalities, which is both time - consuming and inconvenient. In addition, in practical applications, the pattern of missing modalities may be uncertain, for example, one or two modalities are randomly missing. To solve the above problems, the paper proposes a label - assisted Transformer encoder (TATE) network to learn complementary features between modalities. To address the first challenge, a label - encoding module is designed to mark the missing modalities, aiming to guide the network to focus on those missing modalities. For the second challenge, first, the Transformer is used as an extractor to capture the features within the modality, and then they are mapped to a common space through a pairwise projection pattern. After that, the pre - trained full - modality network is used to supervise the encoded vectors. Finally, the output generated by the Transformer encoder is sent to the classifier for sentiment prediction. The main contributions of the paper include: - Proposing the TATE network to deal with the problem of multiple - modal missing in multimodal sentiment analysis and making the code public. - Designing a label - encoding module to cover the cases of single - modal and multiple - modal missing, and adopting a new common - space projection module to learn the joint representation. - On the CMU - MOSI and IEMOCAP datasets, compared with several benchmark models, the proposed TATE model has achieved significant improvement, verifying the effectiveness of the model.

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Modality translation-based multimodal sentiment analysis under uncertain missing modalities

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Missing Modality Reconstruction Network Based on Shared-Specific Features

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis

Accommodating Missing Modalities in Time-Continuous Multimodal Emotion Recognition

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

AMSA: Adaptive Multimodal Learning for Sentiment Analysis

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Multimodal sentiment analysis based on multiple attention