Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Joe Dhanith P R,Shravan Venkatraman,Modigari Narendra,Vigya Sharma,Santhosh Malarvannan,Amir H. Gandomi
2024-08-15
Abstract:Understanding emotions is a fundamental aspect of human communication. Integrating audio and video signals offers a more comprehensive understanding of emotional states compared to traditional methods that rely on a single data source, such as speech or facial expressions. Despite its potential, multimodal emotion recognition faces significant challenges, particularly in synchronization, feature extraction, and fusion of diverse data sources. To address these issues, this paper introduces a novel transformer-based model named Audio-Video Transformer Fusion with Cross Attention (AVT-CA). The AVT-CA model employs a transformer fusion approach to effectively capture and synchronize interlinked features from both audio and video inputs, thereby resolving synchronization problems. Additionally, the Cross Attention mechanism within AVT-CA selectively extracts and emphasizes critical features while discarding irrelevant ones from both modalities, addressing feature extraction and fusion challenges. Extensive experimental analysis conducted on the CMU-MOSEI, RAVDESS and CREMA-D datasets demonstrates the efficacy of the proposed model. The results underscore the importance of AVT-CA in developing precise and reliable multimodal emotion recognition systems for practical applications.
Multimedia,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address several key issues in multimodal emotion recognition: 1. **Synchronization Issue**: Audio and video data are not synchronized in time, which can lead to inaccurate emotion detection as the model may miss some time-sensitive key cues. 2. **Feature Extraction Issue**: Existing models often fail to extract features that sufficiently distinguish different emotional states, especially performing poorly in recognizing subtle or complex emotions. 3. **Fusion Technique Issue**: Current fusion methods (such as simple concatenation or early fusion) may not effectively integrate multimodal data, resulting in suboptimal performance. To address these issues, the paper proposes a new Transformer-based model—Audio-Video Transformer Fusion with Cross Attention (AVT-CA). The model achieves its goals through the following methods: - **Feature Representation Module**: Introduces novel feature extraction strategies, including channel attention, spatial attention, and local feature extractors. These components collectively enhance the model's ability to extract intrinsic correlations from raw facial and vocal data and reduce the impact of inaccurate preprocessing. - **Intermediate-Level Transformer Fusion**: After initial feature extraction, complementary modal information is fused through Transformer blocks, allowing important features to be learned at an early stage. - **Cross Attention Mechanism**: Following intermediate-level Transformer fusion, the prediction is enhanced by selecting the most salient features from each modality, generating more robust and discriminative representations. Through these innovative methods, experimental results on the CMU-MOSEI, RAVDESS, and CREMA-D datasets demonstrate that the AVT-CA model can effectively improve the accuracy and reliability of multimodal emotion recognition systems.