Abstract:Speech emotion recognition plays a crucial role in human-machine interaction systems. Recently various optimized Transformers have been successfully applied to speech emotion recognition. However, the existing Transformer architectures focus more on global information and require large computation. On the other hand, abundant speech emotional representations exist locally on different parts of the input speech. To tackle these problems, we propose a Multi-Scale TRansfomer (MSTR) for speech emotion recognition. It comprises of three main components: (1) a multi-scale temporal feature operator, (2) a fractal self-attention module, and (3) a scale mixer module. These three components can effectively enhance the transformer's ability to learn multi-scale local emotion representations. Experimental results demonstrate that the proposed MSTR model significantly outperforms a vanilla Transformer and other state-of-the-art methods across three speech emotion datasets: IEMOCAP, MELD and, CREMAD. In addition, it can greatly reduce the computational cost.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the task of speech emotion recognition, the existing Transformer architectures mainly focus on global information, require a large amount of computing resources, and fail to fully utilize local emotion representations at different time scales when dealing with speech emotions. Specifically: 1. **High consumption of computing resources**: When dealing with long - sequence data, the computational complexity of the traditional full - attention - mechanism Transformer has a quadratic relationship with the sequence length, which makes it difficult to run on mobile or embedded devices. 2. **Insufficient local emotion representation**: The expression of human emotions in speech has a multi - granularity characteristic, that is, different emotion features may exist in different parts of the speech and within different time spans. Existing methods are often unable to effectively capture these multi - scale local emotion features. To solve these problems, the author proposes a new Multi - Scale Temporal Transformer (MSTR) model, aiming to enhance the Transformer's ability to learn multi - scale local emotion representations through the following three main components: 1. **Multi - Scale Temporal Feature Operator**: Extract multi - scale feature representations in parallel from the original acoustic features or the outputs of lower layers. 2. **Fractal Self - Attention Module**: Efficiently model the temporal relationships between different frames within a fixed - length window. 3. **Scale Mixer Module**: Effectively fuse features at different time scales to create a unified emotion feature representation. Through these designs, the MSTR model not only significantly improves the performance on multiple speech emotion recognition datasets (such as IEMOCAP, MELD, and CREMA - D), but also greatly reduces the computing cost. Experimental results show that the MSTR model outperforms traditional Transformers and other state - of - the - art methods in performance while the computational complexity is greatly reduced.

Multi-Scale Temporal Transformer For Speech Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Multilevel Transformer For Multimodal Emotion Recognition

MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition

Multimodal transformer augmented fusion for speech emotion recognition

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

Temporal-spatial Representation Learning Transformer for EEG-based Emotion Recognition

Facial Expression Recognition Based on Multi-Scale Convolutional Vision Transformer

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition