Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment

Ohad Cohen,Gershon Hazan,Sharon Gannot
2024-09-18
Abstract:This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions. Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis. We evaluate our proposed method on a reverberated version of the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset using synthetic and real-world Room Impulse Responsess (RIRs). Our results demonstrate that integrating audio and video modalities yields superior performance compared to uni-modal approaches, especially in challenging acoustic conditions. Moreover, we show that the multimodal (audiovisual) approach that utilizes multiple microphones outperforms its single-microphone counterpart.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of the accuracy of multi - modal emotion recognition (MER) in reverberant and noisy environments. Specifically, most of the existing emotion recognition systems focus on a single modality (such as text, voice or video), and in complex real - world environments, these models often perform poorly because they fail to fully utilize the multi - modal characteristics of emotional expression. In addition, few studies simultaneously use multi - modality and multiple microphones to improve the performance of emotion recognition. To address these challenges, the author proposes a multi - modal emotion recognition system that combines multi - channel audio processing and video analysis. The system processes multi - channel audio signals through an extended and improved Hierarchical Semantic Audio Transformer (HTS - AT) and uses an R(2 + 1)D Convolutional Neural Network (CNN) model for video analysis. Experimental results show that in reverberant conditions, this multi - modal and multi - microphone method is significantly superior to single - modal and single - microphone methods. ### Specific problem descriptions 1. **Limitations of single modality**: Most existing research mainly focuses on single - modal emotion recognition, for example, relying only on text, voice or video. Although these methods perform well in some scenarios, in complex environments, their performance may decline due to the failure to fully utilize the multi - modal characteristics of emotional expression. 2. **Impact of reverberation and noise**: In the real world, acoustic conditions (such as reverberation and noise) can significantly affect the performance of audio - based emotion recognition systems. Especially in multi - channel audio processing, how to effectively deal with these adverse conditions is an urgent problem to be solved. 3. **Insufficient application of multi - modality and multiple microphones**: Although some studies have explored multi - modal emotion recognition, relatively few studies simultaneously use multi - modality and multiple microphones, especially in applications in reverberant and noisy environments. ### Solutions To solve the above problems, the author proposes the following solutions: 1. **Multi - modal fusion**: Combine the information of audio and video modalities to capture more comprehensive features of emotional expression. 2. **Multi - channel audio processing**: Process multi - channel audio signals through an extended and improved HTS - AT architecture, and use the information provided by multiple microphones to improve the ability to resist reverberation and noise. 3. **Deep learning model**: Use the R(2 + 1)D CNN model for video analysis, which can effectively capture the spatio - temporal features in the video. 4. **Experimental verification**: Conduct experiments on the reverberant version of the RAVDESS dataset, and use synthetic and real - world Room Impulse Responses (RIRs) to evaluate the performance of the proposed method. Through these methods, the author demonstrates the superior performance of the combined use of multi - modality and multiple microphones in complex acoustic conditions, especially in reverberant environments, where the multi - modal method is significantly superior to the single - modal method.