Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
Shiqing Zhang,Shiliang Zhang,Tiejun Huang,Wen Gao
DOI: https://doi.org/10.1109/tmm.2017.2766843
IF: 7.3
2018-01-01
IEEE Transactions on Multimedia
Abstract:Speech emotion recognition is challenging because of the affective gap between the subjective emotions and low-level features. Integrating multilevel feature learning and model training, deep convolutional neural networks (DCNN) has exhibited remarkable success in bridging the semantic gap in visual tasks like image classification, object detection. This paper explores how to utilize a DCNN to bridge the affective gap in speech signals. To this end, we first extract three channels of log Mel-spectrograms (static, delta, and delta delta) similar to the red, green, blue (RGB) image representation as the DCNN input. Then, the AlexNet DCNN model pretrained on the large ImageNet dataset is employed to learn high-level feature representations on each segment divided from an utterance. The learned segment-level features are aggregated by a discriminant temporal pyramid matching (DTPM) strategy. DTPM combines temporal pyramid matching and optimal Lp-norm pooling to form a global utterance-level feature representation, followed by the linear support vector machines for emotion classification. Experimental results on four public datasets, that is, EMO-DB, RML, eNTERFACE05, and BAUM-1s, show the promising performance of our DCNN model and the DTPM strategy. Another interesting finding is that the DCNN model pretrained for image applications performs reasonably good in affective speech feature extraction. Further fine tuning on the target emotional speech datasets substantially promotes recognition performance.