Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Xiaoyu Tang,Yixin Lin,Ting Dang,Yuanfang Zhang,Jintao Cheng

2024-06-04

Abstract:Speech Emotion Recognition (SER) is crucial in human-machine interactions. Mainstream approaches utilize Convolutional Neural Networks or Recurrent Neural Networks to learn local energy feature representations of speech segments from speech information, but struggle with capturing global information such as the duration of energy in speech. Some use Transformers to capture global information, but there is room for improvement in terms of parameter count and performance. Furthermore, existing attention mechanisms focus on spatial or channel dimensions, hindering learning of important temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time-frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods.

Audio and Speech Processing

What problem does this paper attempt to address?

The paper attempts to address the issue in Speech Emotion Recognition (SER) where existing methods are insufficient in capturing both local and global information of speech signals. Specifically: 1. **Limitations of mainstream methods**: Current methods mainly use Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) to learn local energy feature representations from speech information, but these methods struggle to capture global information, such as the duration of energy in speech. 2. **Room for improvement in Transformer**: Although some studies use Transformers to capture global information, there is still room for improvement in terms of the number of parameters and performance. 3. **Limitations of attention mechanisms**: Existing attention mechanisms mainly focus on spatial or channel dimensions and fail to effectively capture important temporal information in speech. To overcome the above issues, the paper proposes a speech emotion recognition network based on CNN-Transformer and multi-dimensional attention mechanisms, aiming to model local and global information of speech at different granularities and capture temporal, spatial, and channel dependencies. The specific contributions include: - **Framework design**: A framework based on CNN and Transformer is proposed, which extracts initial local features of speech through time-frequency domain convolution and stacked convolutional blocks, and enhances local and global features through stacked CNN and Transformer blocks. - **Temporal-Channel-Spatial Attention Mechanism (T-Sa)**: A new temporal-channel-spatial attention mechanism is introduced, which models temporal information through bidirectional LSTM and efficiently integrates attention in spatial and channel dimensions through Shuffle units. - **Lightweight Convolutional Transformer (LCT) module**: A module combining depthwise separable convolution and lightweight Transformer is proposed, which can efficiently extract local information and capture long-distance dependencies between features. Through these innovations, the paper aims to improve the performance of speech emotion recognition, especially in handling complex speech signals. Experimental results show that the proposed method significantly outperforms existing methods on the IEMOCAP and Emo-DB datasets.

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Self-attention Transfer Networks for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition.

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Learning multi-scale features for speech emotion recognition with connection attention mechanism

Speech Emotion Recognition with Hybrid Neural Network

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition.

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Speech Emotion Recognition Using Sequential Capsule Networks

Ms-senet: Enhancing Speech Emotion Recognition Through Multi-scale Feature Fusion With Squeeze-and-excitation Blocks

Learning Salient Features for Speech Emotion Recognition Using CNN

Speech Emotion Recognition Based on Improved Masking EMD and Convolutional Recurrent Neural Network.

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks