Abstract:Continuous affective state estimation from facial information is a task which requires the prediction of time series of emotional state outputs from a facial image sequence. Modeling the spatial-temporal evolution of facial information plays an important role in affective state estimation. One of the most widely used methods is Recurrent Neural Networks (RNN). RNNs provide an attractive framework for propagating information over a sequence using a continuous-valued hidden layer representation. In this work, we propose to instead learn rich affective state dynamics. We model human affect as a dynamical system and define the affective state in terms of valence, arousal and their higher-order derivatives. We then pose the affective state estimation problem as a jointly trained state estimator for high-dimensional input images, combining an RNN and a Bayesian Filter, i.e. Kalman filters (KF) and Extended Kalman filters (EKF), so that all weights in the resulting network can be trained using backpropagation. We use a recently proposed general framework for designing and learning discriminative state estimators framed as computational graphs. Such approach can handle high dimensional observations and efficiently optimize, in an end-to-end fashion, the state estimator. In addition, to deal with the asynchrony between emotion labels and input images, caused by the inherent reaction lag of the annotators, we introduce a convolutional layer that aligns features with emotion labels. Experimental results, on the RECOLA and SEMAINE datasets for continuous emotion prediction, illustrate the potential of the proposed framework compared to recent state-of-the-art models.

Prediction-based Learning for Continuous Emotion Recognition in Speech

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Implementing machine learning techniques for continuous emotion prediction from uniformly segmented voice recordings

Learning Long-Term Temporal Contexts Using Skip RNN for Continuous Emotion Recognition

Continuous Emotion Ambiguity Prediction: Modeling with Beta Distributions

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks

Continuous Affect Prediction Using Eye Gaze and Speech

Time-Delay Neural Network for Continuous Emotional Dimension Prediction from Facial Expression Sequences.

Efficient Modeling of Long Temporal Contexts for Continuous Emotion Recognition.

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Continuous Emotion Recognition via Deep Convolutional Autoencoder and Support Vector Regressor

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

A Bayesian Filtering Framework for Continuous Affect Recognition from Facial Images

Speech, Head, and Eye-based Cues for Continuous Affect Prediction

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

A Deep Bidirectional Long Short-Term Memory Based Multi-Scale Approach for Music Dynamic Emotion Prediction

Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks