Abstract:Sentiment analysis is an important research field aiming to extract and fuse sentimental information from human utterances. Due to the diversity of human sentiment, analyzing from multiple modalities is usually more accurate than from a single modality. To complement the information between related modalities, one effective approach is performing cross-modality interactions. Recently, Transformer-based frameworks have shown a strong ability to capture long-range dependencies, leading to the introduction of several Transformer-based approaches for multimodal processing. However, due to the built-in attention mechanism of the Transformers, only two modalities can be engaged at once. As a result, the complementary information flow in these Transformer-based techniques is partial and constrained. To mitigate this, we propose, TensorFormer, a tensor-based multimodal Transformer framework that takes into account all relevant modalities for interactions. More precisely, we first construct a tensor utilizing the features extracted from each modality, assuming one modality is the target while the remaining tensors serve as the sources. We can generate the corresponding interacted features by calculating source-target attention. This strategy interacts with all involved modalities and generates complementing global information. Experiments on multimodal sentiment analysis benchmark datasets demonstrated the effectiveness of TensorFormer. In addition, we also evaluate TensorFormer in another related area: depression detection and the results reveal significant improvements when compared to other state-of-the-art methods.

TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

TeFNA: Text-centered Fusion Network with crossmodal Attention for multimodal sentiment analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Cross-modal Enhancement Network for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

Tensor Fusion Network for Multimodal Sentiment Analysis

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Tri-Modalities Fusion for Multimodal Sentiment Analysis

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

A text guided multi-task learning network for multimodal sentiment analysis