Speech Emotion Recognition Using Sequential Capsule Networks

Xixin Wu,Songxiang Liu,Yuewen Cao,Xu Li,Jianwei Yu,Dongyang Dai,Xi Ma,Shoukang Hu,Zhiyong Wu,Xunying Liu,Helen Meng
DOI: https://doi.org/10.1109/taslp.2021.3120586
2019-01-01
Abstract:Speech emotion recognition (SER) is an indispensable part of fluid human-machine interaction and attracts lots of research attentions. Recent work on SER has successfully applied convolutional neural networks (CNNs) to learn feature representations from speech spectrograms. However, the fundamental problem of CNNs is that the spatial information in spectrograms is lost, which includes positional and relationship information of low-level features, such as pitch and formant frequencies. We propose a novel architecture of sequential capsule networks (CapNets) by leveraging the advantange of CapNets that spatial information can be preserved in capsules and passed to upper capsule layers via dynamic routing. Also, the dynamic routing algorithm provides an effective alternative to pooling or storing recurrent hidden states for obtaining utterance-level features from the sequential capsule outputs. To further improve the model's ability to capture contextual information, we introduce a recurrent connection to the sequential structure. The experimental comparison of the proposed systems and previously published systems using CNNs and recurrent neural networks (RNNs) based on the IEMOCAP corpus demonstrates the effectiveness of the proposed sequential CapNets.
What problem does this paper attempt to address?