Speech Emotion Recognition Method Based on Cross-Layer Intersectant Fusion

Kaiqiao Wang,Peng Liu,Songbin Li,Jingang Wang,Cheng Zhang
DOI: https://doi.org/10.1007/978-981-97-1280-9_21
2024-01-01
Abstract:Speech emotion recognition (SER) is a key technology in human-computer interaction (HCI) systems. Although the existing neural-based methods have achieved some satisfactory results in recognition accuracy, the failure of effective in-depth fusion of multi-scale features hinders the improvement of the accuracy of SER. In this paper, we address this issue from the two aspects of extracting exhaustive features and fusing features of multi-scale. In particular, we propose a recognition network based on Cross-Layer Intersectant Fusion, termed CLIF. It mainly consists of multi-scale feature extraction and cross-layer intersectant fusion. The former takes acoustic features as input and extracts feature maps with different receptive field ranges layer by layer through deepening convolution structures. Among these features, the lower level has more original information but also contains noise. The higher level has emotional semantics that is easier to classify but loses the perception of the details of the original acoustic features. Therefore, we use the cross-layer intersectant fusion module to achieve efficient utilization of low-level and high-level features. The experimental results demonstrate that the proposed CLIF is superior to the existing state-of-the-art speech emotion recognition algorithm. The overall recognition accuracies of CLIF can achieve 82.17% and 93.26% on IEMOCAP and CASIA datasets respectively.
What problem does this paper attempt to address?