Abstract:In the realm of consumer technology, Artificial Intelligence (AI)-based Speech Emotion Recognition (SER) has rapidly gained traction and integration into smart home systems. Its precision in recognition has become a pivotal factor significantly impacting user experience. However, the intricate task of selecting suitable features has emerged as a daunting challenge due to the variances in speech features induced by emotional nuances. Present research predominantly concentrates on localized speech characteristics, neglecting the broader contextual cues inherent in speech signals. This oversight contributes to relatively diminished accuracy in emotion recognition within smart home systems. To tackle this challenge, this paper introduces an enhanced Speech Emotion Recognition approach named TF-Mix. This methodology enriches emotional prediction from speech by leveraging audio data augmentation and embracing multiple features, thereby achieving superior performance in emotion recognition. To augment the model's adaptability, TF-Mix adeptly amalgamates various feature extraction techniques, encompassing Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and Transformer architecture. The synergy among these methodologies culminates in the formulation of three distinct architectural models. The primary architecture is founded on a 1-dimensional Convolutional Neural Network (CNN), closely followed by a Fully Connected Network (FCN). Subsequent architectures, notably BiLSTM-FCN and BiLSTM-Transformer-FCN, retain their respective structures while incorporating CNNs. Moreover, the amalgamation of individual models into an ensemble model, designated as D, via weighted averaging, further amplifies the efficacy of emotion recognition. Experimental outcomes showcase exceptional performance across all four models in the SER task. The ensemble Model D achieves noteworthy accuracy across multiple datasets: 87.513% on RAVDESS, 86.233% on SAVEE, 99.857% on TESS, 82.295% on CREMA-D, and 97.546% on the TOTAL dataset.

Ensemble System for Multimodal Emotion Recognition Challenge (MEC 2017)

Emotion Recognition in Videos via Fusing Multimodal Features.

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion

MEC 2017: Multimodal Emotion Recognition Challenge

An autoencoder-based feature level fusion for speech emotion recognition

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016.

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Multi-level attention fusion network assisted by relative entropy alignment for multimodal speech emotion recognition

Multi-modal Expression Recognition with Ensemble Method

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

An Ensemble Framework of Voice-Based Emotion Recognition System for Films and TV Programs