Abstract:Multimodal sentiment classification is a notable research field that aims to refine sentimental information and classify the sentiment tendency from sequential multimodal data. Most existing sentimental recognition algorithms explore multimodal fusion schemes that achieve good performance. However, there are two key challenges to overcome. First, it is essential to effectively extract inter- and intra-modality features prior to fusion, while simultaneously reducing ambiguity. The second challenge is how to learn modality-invariant representations that capture the underlying similarities. In this paper, we present a modality-invariant temporal learning technique and a new gated inter-modality attention mechanism to overcome these issues. For the first challenge, our proposed gated inter-modality attention mechanism performs modality interactions and filters inconsistencies from multiple modalities in an adaptive manner. We also use parallel structures to learn more comprehensive sentimental information in pairs (i.e., acoustic and visual). In addition, to address the second problem, we treat each modality as a multivariate Gaussian distribution (considering each timestamp as a single Gaussian distribution) and use the KL divergence to capture the implicit temporal distribution-level similarities. These strategies are helpful in reducing domain shifts between different modalities and extracting effective sequential modality-invariant representations. We have conducted experiments on several public datasets (i.e., YouTube and MOUD) and the results show that our proposed method outperforms the state-of-the-art multimodal sentiment categorization methods.

Towards Temporal Modelling of Categorical Speech Emotion Recognition

Sequence-to-sequence Modelling for Categorical Speech Emotion Recognition Using Recurrent Neural Network

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Self-attention Transfer Networks for Speech Emotion Recognition

Bayesian Inference Based Temporal Modeling for Naturalistic Affective Expression Classification

Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition

Emotion recognition by fusing time synchronous and time asynchronous representations

Speech Emotion Classification with the Combination of Statistic Features and Temporal Features.

Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech

Efficient Modeling of Long Temporal Contexts for Continuous Emotion Recognition.

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition

Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition

Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition