Multi-Scale Temporal Transformer For Speech Emotion Recognition

Zhipeng Li,Xiaofen Xing,Yuanbo Fang,Weibin Zhang,Hengsheng Fan,Xiangmin Xu
2024-10-01
Abstract:Speech emotion recognition plays a crucial role in human-machine interaction systems. Recently various optimized Transformers have been successfully applied to speech emotion recognition. However, the existing Transformer architectures focus more on global information and require large computation. On the other hand, abundant speech emotional representations exist locally on different parts of the input speech. To tackle these problems, we propose a Multi-Scale TRansfomer (MSTR) for speech emotion recognition. It comprises of three main components: (1) a multi-scale temporal feature operator, (2) a fractal self-attention module, and (3) a scale mixer module. These three components can effectively enhance the transformer's ability to learn multi-scale local emotion representations. Experimental results demonstrate that the proposed MSTR model significantly outperforms a vanilla Transformer and other state-of-the-art methods across three speech emotion datasets: IEMOCAP, MELD and, CREMAD. In addition, it can greatly reduce the computational cost.
Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the task of speech emotion recognition, the existing Transformer architectures mainly focus on global information, require a large amount of computing resources, and fail to fully utilize local emotion representations at different time scales when dealing with speech emotions. Specifically: 1. **High consumption of computing resources**: When dealing with long - sequence data, the computational complexity of the traditional full - attention - mechanism Transformer has a quadratic relationship with the sequence length, which makes it difficult to run on mobile or embedded devices. 2. **Insufficient local emotion representation**: The expression of human emotions in speech has a multi - granularity characteristic, that is, different emotion features may exist in different parts of the speech and within different time spans. Existing methods are often unable to effectively capture these multi - scale local emotion features. To solve these problems, the author proposes a new Multi - Scale Temporal Transformer (MSTR) model, aiming to enhance the Transformer's ability to learn multi - scale local emotion representations through the following three main components: 1. **Multi - Scale Temporal Feature Operator**: Extract multi - scale feature representations in parallel from the original acoustic features or the outputs of lower layers. 2. **Fractal Self - Attention Module**: Efficiently model the temporal relationships between different frames within a fixed - length window. 3. **Scale Mixer Module**: Effectively fuse features at different time scales to create a unified emotion feature representation. Through these designs, the MSTR model not only significantly improves the performance on multiple speech emotion recognition datasets (such as IEMOCAP, MELD, and CREMA - D), but also greatly reduces the computing cost. Experimental results show that the MSTR model outperforms traditional Transformers and other state - of - the - art methods in performance while the computational complexity is greatly reduced.