A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition
Yang Liu,Haoqin Sun,Wenbo Guan,Yuqi Xia,Yongwei Li,Masashi Unoki,Zhen Zhao
DOI: https://doi.org/10.1109/taslp.2023.3245401
2023-01-01
Abstract:Currently, speech emotion recognition models still could not show satisfactory performance due to the complexity of emotions. In most of the previous studies, there is a common problem that some of the particular emotions are severely misclassified. In this article, we propose a novel framework integrating cascaded attention network and adversarial joint loss strategy for speech emotion recognition, aiming at discriminating the confusions by emphasizing more on the emotions which are difficult to be correctly classified. First, we extract log-Mels, deltas and delta-deltas of log-Mels as 3D features to effectively reduce the interference of external factors. Next, we introduce a cascaded attention network to extract effective emotional features, where spatiotemporal attention selectively locates the targeted emotional regions from the input features. In these targeted regions, the self attention with head fusion captures the long-distance dependence of temporal features. Finally, an adversarial joint loss strategy is proposed to distinguish the emotional embeddings with high similarity by the generated hard triplets in an adversarial fashion. To evaluate our proposed method, experiments are performed with the IEMOCAP, CASIA, and EMODB corpora. The experimental results demonstrate that our proposed method significantly outperforms the state-of-the-art approaches on all datasets.
engineering, electrical & electronic,acoustics