Abstract:Sentiment analysis is an important research field aiming to extract and fuse sentimental information from human utterances. Due to the diversity of human sentiment, analyzing from multiple modalities is usually more accurate than from a single modality. To complement the information between related modalities, one effective approach is performing cross-modality interactions. Recently, Transformer-based frameworks have shown a strong ability to capture long-range dependencies, leading to the introduction of several Transformer-based approaches for multimodal processing. However, due to the built-in attention mechanism of the Transformers, only two modalities can be engaged at once. As a result, the complementary information flow in these Transformer-based techniques is partial and constrained. To mitigate this, we propose, TensorFormer, a tensor-based multimodal Transformer framework that takes into account all relevant modalities for interactions. More precisely, we first construct a tensor utilizing the features extracted from each modality, assuming one modality is the target while the remaining tensors serve as the sources. We can generate the corresponding interacted features by calculating source-target attention. This strategy interacts with all involved modalities and generates complementing global information. Experiments on multimodal sentiment analysis benchmark datasets demonstrated the effectiveness of TensorFormer. In addition, we also evaluate TensorFormer in another related area: depression detection and the results reveal significant improvements when compared to other state-of-the-art methods.

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

A Two-Stage Stacked Transformer Framework for Multimodal Sentiment Analysis

Cross-modal Enhancement Network for Multimodal Sentiment Analysis

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Tensor Fusion Network for Multimodal Sentiment Analysis

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Cross-modal sentiment analysis based on Transformer and image-text collaborative interaction

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis