Abstract:Sentiment analysis is an important research field aiming to extract and fuse sentimental information from human utterances. Due to the diversity of human sentiment, analyzing from multiple modalities is usually more accurate than from a single modality. To complement the information between related modalities, one effective approach is performing cross-modality interactions. Recently, Transformer-based frameworks have shown a strong ability to capture long-range dependencies, leading to the introduction of several Transformer-based approaches for multimodal processing. However, due to the built-in attention mechanism of the Transformers, only two modalities can be engaged at once. As a result, the complementary information flow in these Transformer-based techniques is partial and constrained. To mitigate this, we propose, TensorFormer, a tensor-based multimodal Transformer framework that takes into account all relevant modalities for interactions. More precisely, we first construct a tensor utilizing the features extracted from each modality, assuming one modality is the target while the remaining tensors serve as the sources. We can generate the corresponding interacted features by calculating source-target attention. This strategy interacts with all involved modalities and generates complementing global information. Experiments on multimodal sentiment analysis benchmark datasets demonstrated the effectiveness of TensorFormer. In addition, we also evaluate TensorFormer in another related area: depression detection and the results reveal significant improvements when compared to other state-of-the-art methods.

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

Tri-Modalities Fusion for Multimodal Sentiment Analysis

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

Multimodal transformer augmented fusion for speech emotion recognition

Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism

Low Rank Fusion based Transformers for Multimodal Sequences

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Multilevel Transformer For Multimodal Emotion Recognition

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Multimodal Transformer Fusion for Continuous Emotion Recognition

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Tensor Fusion Network for Multimodal Sentiment Analysis

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis