A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

Congbao Shi,Yuanyuan Zhang,Baolin Liu
DOI: https://doi.org/10.1007/s10489-024-05329-w
IF: 5.3
2024-02-22
Applied Intelligence
Abstract:Continuous emotion recognition plays a crucial role in developing friendly and natural human-computer interaction applications. However, there exist two significant challenges unresolved in this field: how to effectively fuse complementary information from multiple modalities and capture long-range contextual dependencies during emotional evolution. In this paper, a novel multimodal continuous emotion recognition framework was proposed to address the above challenges. For the multimodal fusion challenge, the Multimodal Attention Fusion (MAF) method is proposed to fully utilize complementarity and redundancy between multiple modalities. To tackle temporal context dependencies, the Local Contextual Temporal Convolutional Network (LC-TCN) and the Global Contextual Temporal Convolutional Network (GC-TCN) were presented. These networks have the ability to progressively integrate multi-scale temporal contextual information from input streams of different modalities. Comprehensive experiments are conducted on the RECOLA and SEWA datasets to assess the effectiveness of our proposed framework. The experimental results demonstrate superior recognition performance compared to state-of-the-art approaches, achieving 0.834 and 0.671 on RECOLA, 0.573 and 0.533 on SEWA in terms of arousal and valence, respectively. These findings indicate a novel direction for continuous emotion recognition by exploring temporal multi-scale information.
computer science, artificial intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address two key challenges in continuous emotion recognition: 1. **Effective Fusion of Multimodal Information**: How to effectively fuse complementary information from different modalities (such as video, audio, etc.). 2. **Capturing Long-term Context Dependencies**: How to capture long-term context dependencies in the process of emotion evolution. These two issues have not been fully resolved in the field of continuous emotion recognition, affecting the robustness and accuracy of emotion recognition systems and limiting their application in various real-world scenarios. To address these problems, the authors propose a deep learning framework based on multimodal fusion and local-global contextual temporal convolutional networks. ### Specific Methods 1. **Multimodal Attention Fusion (MAF)**: - A new multimodal fusion method is proposed, including intra-modal attention and inter-modal attention, to fully utilize the complex nonlinear relationships between different modalities and promote the dynamic interaction of emotional information. 2. **Local Contextual Temporal Convolutional Network (LC-TCN)** and **Global Contextual Temporal Convolutional Network (GC-TCN)**: - LC-TCN captures local multi-scale contextual information of each feature stream through parallel dilated convolution layers and channel attention mechanisms. - GC-TCN captures global multi-scale contextual information through dilated convolution layers and dense connections, used to predict emotional dimension values (such as arousal and valence). ### Experimental Results The authors conducted experiments on the RECOLA and SEW A datasets to verify the effectiveness of the proposed framework. The experimental results show that the framework outperforms existing methods in predicting arousal and valence, achieving 0.834 and 0.671 on the RECOLA dataset, and 0.573 and 0.533 on the SEW A dataset, respectively. ### Main Contributions 1. **Multimodal Information Fusion**: A model-level fusion method based on attention mechanisms is proposed, effectively learning the complex nonlinear relationships between different modalities and promoting the dynamic interaction of emotional information. 2. **Temporal Context Dependencies**: The local contextual temporal convolutional network and global contextual temporal convolutional network are proposed, combining multi-scale contextual information along the time axis to gradually integrate multi-scale temporal contextual information. 3. **End-to-End Trainable Model Framework**: An end-to-end trainable model framework that integrates multimodal data is constructed for frame-level emotion state prediction. ### Conclusion The multimodal continuous emotion recognition framework proposed in this paper not only addresses the challenges of multimodal data fusion and temporal context modeling but also significantly improves the performance of emotion recognition, providing new directions for research in continuous emotion recognition.