Abstract:Multimodal sentiment classification is a notable research field that aims to refine sentimental information and classify the sentiment tendency from sequential multimodal data. Most existing sentimental recognition algorithms explore multimodal fusion schemes that achieve good performance. However, there are two key challenges to overcome. First, it is essential to effectively extract inter- and intra-modality features prior to fusion, while simultaneously reducing ambiguity. The second challenge is how to learn modality-invariant representations that capture the underlying similarities. In this paper, we present a modality-invariant temporal learning technique and a new gated inter-modality attention mechanism to overcome these issues. For the first challenge, our proposed gated inter-modality attention mechanism performs modality interactions and filters inconsistencies from multiple modalities in an adaptive manner. We also use parallel structures to learn more comprehensive sentimental information in pairs (i.e., acoustic and visual). In addition, to address the second problem, we treat each modality as a multivariate Gaussian distribution (considering each timestamp as a single Gaussian distribution) and use the KL divergence to capture the implicit temporal distribution-level similarities. These strategies are helpful in reducing domain shifts between different modalities and extracting effective sequential modality-invariant representations. We have conducted experiments on several public datasets (i.e., YouTube and MOUD) and the results show that our proposed method outperforms the state-of-the-art multimodal sentiment categorization methods.

Learning Joint Multimodal Representation with Adversarial Attention Networks

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Multimodal Adversarially Learned Inference with Factorized Discriminators

Adversarial Multimodal Network for Movie Question Answering

Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis

Multimodal Learning of Social Image Representation by Exploiting Social Relations

Multimodal Representation Learning by Alternating Unimodal Adaptation

Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning

Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching

Learning Social Image Embedding with Deep Multimodal Attention Networks

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Human-mouse somatic cell hybrid lines selected for human adenosine kinase: a new selective method.

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Generalizable Multi-Linear Attention Network

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Adversarial Learning-Based Semantic Correlation Representation for Cross-Modal Retrieval

Multimodal Semantic Attention Network for Video Captioning

Learning joint relationship attention network for image captioning

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval