Abstract:Sentiment analysis is an important research field aiming to extract and fuse sentimental information from human utterances. Due to the diversity of human sentiment, analyzing from multiple modalities is usually more accurate than from a single modality. To complement the information between related modalities, one effective approach is performing cross-modality interactions. Recently, Transformer-based frameworks have shown a strong ability to capture long-range dependencies, leading to the introduction of several Transformer-based approaches for multimodal processing. However, due to the built-in attention mechanism of the Transformers, only two modalities can be engaged at once. As a result, the complementary information flow in these Transformer-based techniques is partial and constrained. To mitigate this, we propose, TensorFormer, a tensor-based multimodal Transformer framework that takes into account all relevant modalities for interactions. More precisely, we first construct a tensor utilizing the features extracted from each modality, assuming one modality is the target while the remaining tensors serve as the sources. We can generate the corresponding interacted features by calculating source-target attention. This strategy interacts with all involved modalities and generates complementing global information. Experiments on multimodal sentiment analysis benchmark datasets demonstrated the effectiveness of TensorFormer. In addition, we also evaluate TensorFormer in another related area: depression detection and the results reveal significant improvements when compared to other state-of-the-art methods.

A Two-Stage Stacked Transformer Framework for Multimodal Sentiment Analysis

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

Multilevel Transformer For Multimodal Emotion Recognition

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Multimodal transformer augmented fusion for speech emotion recognition

Mutually Beneficial Transformer for Multimodal Data Fusion

A Fine-Grained Modal Label-Based Multi-Stage Network for Multimodal Sentiment Analysis.

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling