Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance
Gang Chen,Shiqing Zhang,Xin Tao,Xiaoming Zhao
DOI: https://doi.org/10.1109/access.2020.3038493
IF: 3.9
2020-01-01
IEEE Access
Abstract:In the domain of speech emotion recognition (SER), generally there is an unbalanced data distribution of emotional samples in existing emotional datasets. Moreover, different fragment areas in an utterance contribute diversely to SER. To address these two issues, this paper proposes a new SER method by combining a unified first-order attention network with data balance. The proposed method firstly utilizes the strategy of data balance to augment and balance the training data. Then, a pre-trained convolutional neural network (CNN) model (i.e., VGGish) is fine-tuned on target emotional datasets to learn segment-level speech features from the extracted Log Mel-spectrograms. Next, the unified first-order attention mechanism, including different feature-pooling strategies such as sum, min, max, mean, and standard deviation (std), is embedded into the output of a bi-directional long short-term memory (Bi-LSTM) network. This is used for learning high-level discriminative segment-level features, and simultaneously aggregating the learned segment-level features into fixed-length utterance-level features for SER. Finally, based on utterance-level features, the softmax layer in a Bi-LSTM network is adopted to conduct final emotion classification task. Extensive experiments are implemented on three public datasets such as BAUM-1s, AFEW5.0, and CHEAVD2.0, demonstrate the advantage of the proposed method.