Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Xiaoyu Tang,Yixin Lin,Ting Dang,Yuanfang Zhang,Jintao Cheng
2024-06-04
Abstract:Speech Emotion Recognition (SER) is crucial in human-machine interactions. Mainstream approaches utilize Convolutional Neural Networks or Recurrent Neural Networks to learn local energy feature representations of speech segments from speech information, but struggle with capturing global information such as the duration of energy in speech. Some use Transformers to capture global information, but there is room for improvement in terms of parameter count and performance. Furthermore, existing attention mechanisms focus on spatial or channel dimensions, hindering learning of important temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time-frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods.
Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the issue in Speech Emotion Recognition (SER) where existing methods are insufficient in capturing both local and global information of speech signals. Specifically: 1. **Limitations of mainstream methods**: Current methods mainly use Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) to learn local energy feature representations from speech information, but these methods struggle to capture global information, such as the duration of energy in speech. 2. **Room for improvement in Transformer**: Although some studies use Transformers to capture global information, there is still room for improvement in terms of the number of parameters and performance. 3. **Limitations of attention mechanisms**: Existing attention mechanisms mainly focus on spatial or channel dimensions and fail to effectively capture important temporal information in speech. To overcome the above issues, the paper proposes a speech emotion recognition network based on CNN-Transformer and multi-dimensional attention mechanisms, aiming to model local and global information of speech at different granularities and capture temporal, spatial, and channel dependencies. The specific contributions include: - **Framework design**: A framework based on CNN and Transformer is proposed, which extracts initial local features of speech through time-frequency domain convolution and stacked convolutional blocks, and enhances local and global features through stacked CNN and Transformer blocks. - **Temporal-Channel-Spatial Attention Mechanism (T-Sa)**: A new temporal-channel-spatial attention mechanism is introduced, which models temporal information through bidirectional LSTM and efficiently integrates attention in spatial and channel dimensions through Shuffle units. - **Lightweight Convolutional Transformer (LCT) module**: A module combining depthwise separable convolution and lightweight Transformer is proposed, which can efficiently extract local information and capture long-distance dependencies between features. Through these innovations, the paper aims to improve the performance of speech emotion recognition, especially in handling complex speech signals. Experimental results show that the proposed method significantly outperforms existing methods on the IEMOCAP and Emo-DB datasets.