Abstract:Emotion recognition through facial expression and non-verbal speech represents an important area in affective computing. They have been extensively studied from classical feature extraction techniques to more recent deep learning approaches. However, most of these approaches face two major challenges: (1) robustness—in the face of degradation such as noise, can a model still make correct predictions? and (2) cross-dataset generalisation—when a model is trained on one dataset, can it be used to make inference on another dataset?. To directly address these challenges, we first propose the application of a spiking neural network (SNN) in predicting emotional states based on facial expression and speech data, then investigate, and compare their accuracy when facing data degradation or unseen new input. We evaluate our approach on third-party, publicly available datasets and compare to the state-of-the-art techniques. Our approach demonstrates robustness to noise, where it achieves an accuracy of 56.2% for facial expression recognition (FER) compared to 22.64% and 14.10% for CNN and SVM, respectively, when input images are degraded with the noise intensity of 0.5, and the highest accuracy of 74.3% for speech emotion recognition (SER) compared to 21.95% of CNN and 14.75% for SVM when audio white noise is applied. For generalisation, our approach achieves consistently high accuracy of 89% for FER and 70% for SER in cross-dataset evaluation and suggests that it can learn more effective feature representations, which lead to good generalisation of facial features and vocal characteristics across subjects.

A Comparison of Expressive Speech Synthesis Approaches based on Neural Network

Emotional Statistical Parametric Speech Synthesis Using LSTM-RNNs

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Linear Networks Based Speaker Adaptation for Speech Synthesis

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data

Generalisation and Robustness Investigation for Facial and Speech Emotion Recognition Using Bio-Inspired Spiking Neural Networks

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

A New Network Structure for Speech Emotion Recognition Research

Research on Chinese Speech Emotion Recognition Based on Deep Neural Network and Acoustic Features

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

Feature Based Adaptation for Speaking Style Synthesis

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Transfer Learning Based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis.

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

Improving Deep Neural Network Based Speech Synthesis Through Contextual Feature Parametrization and Multi-Task Learning

Accounting for Variations in Speech Emotion Recognition with Nonparametric Hierarchical Neural Network

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.

Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach.