Abstract:The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio-video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatio-temporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.

Extracting Method for Fine-Grained Emotional Features in Videos

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Emotion Recognition in Videos via Fusing Multimodal Features.

Deep Spectrum Feature Representations for Speech Emotion Recognition

Exploiting EEG signals and audiovisual feature fusion for video emotion recognition

Multi-modal emotion analysis from facial expressions and electroencephalogram.

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

FDR-MSA: Enhancing multimodal sentiment analysis through feature disentanglement and reconstruction

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Emotional Tagging Of Videos By Exploring Multiple Emotions' Coexistence

Mutilmodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos

Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Enhancing Multimodal Emotional Information Extraction in Film and Television through Adaptive Feature Fusion with DenseNe, Transformer, and 3D CNN Models

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

Integrative Sentiment Analysis: Leveraging Audio, Visual, and Textual Data

Design and Efficacy of a Data Lake Architecture for Multimodal Emotion Feature Extraction in Social Media

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition