Learning Salient Features for Speech Emotion Recognition Using CNN

Jiamu Liu,Wenjing Han,Huabin Ruan,Xiaomin Chen,Dongmei Jiang,Haifeng Li
DOI: https://doi.org/10.1109/aciiasia.2018.8470393
2018-01-01
Abstract:In this work, a framework based on Convolution Neural Network (CNN) is proposed for speech emotion recognition (SER). We focus on extracting the most salient frames via the proposed CNN structure from the entire frame sequence to represent the utterance. A particular pooling method named global k-max pooling is utilized in our CNN structure (GCNN) to achieve the above objective. We implemented SER experiments on Interactive Emotional Dyadic Motion Capture (IEMOCAP), results are compared to those of some other CNN structures to validate the advancement of the presented framework. The experimental results turn out that GCNN outperforms others CNN models. Besides, experiments are also done to explore how many key frames should be output from GCNN to involve salient emotional information, results illuminate that limited length representation is properer while too long representation is likely containing redundant information decreasing the performance of the model.
What problem does this paper attempt to address?