Abstract:Continuous emotion recognition plays a crucial role in developing friendly and natural human-computer interaction applications. However, there exist two significant challenges unresolved in this field: how to effectively fuse complementary information from multiple modalities and capture long-range contextual dependencies during emotional evolution. In this paper, a novel multimodal continuous emotion recognition framework was proposed to address the above challenges. For the multimodal fusion challenge, the Multimodal Attention Fusion (MAF) method is proposed to fully utilize complementarity and redundancy between multiple modalities. To tackle temporal context dependencies, the Local Contextual Temporal Convolutional Network (LC-TCN) and the Global Contextual Temporal Convolutional Network (GC-TCN) were presented. These networks have the ability to progressively integrate multi-scale temporal contextual information from input streams of different modalities. Comprehensive experiments are conducted on the RECOLA and SEWA datasets to assess the effectiveness of our proposed framework. The experimental results demonstrate superior recognition performance compared to state-of-the-art approaches, achieving 0.834 and 0.671 on RECOLA, 0.573 and 0.533 on SEWA in terms of arousal and valence, respectively. These findings indicate a novel direction for continuous emotion recognition by exploring temporal multi-scale information.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address two key challenges in continuous emotion recognition: 1. **Effective Fusion of Multimodal Information**: How to effectively fuse complementary information from different modalities (such as video, audio, etc.). 2. **Capturing Long-term Context Dependencies**: How to capture long-term context dependencies in the process of emotion evolution. These two issues have not been fully resolved in the field of continuous emotion recognition, affecting the robustness and accuracy of emotion recognition systems and limiting their application in various real-world scenarios. To address these problems, the authors propose a deep learning framework based on multimodal fusion and local-global contextual temporal convolutional networks. ### Specific Methods 1. **Multimodal Attention Fusion (MAF)**: - A new multimodal fusion method is proposed, including intra-modal attention and inter-modal attention, to fully utilize the complex nonlinear relationships between different modalities and promote the dynamic interaction of emotional information. 2. **Local Contextual Temporal Convolutional Network (LC-TCN)** and **Global Contextual Temporal Convolutional Network (GC-TCN)**: - LC-TCN captures local multi-scale contextual information of each feature stream through parallel dilated convolution layers and channel attention mechanisms. - GC-TCN captures global multi-scale contextual information through dilated convolution layers and dense connections, used to predict emotional dimension values (such as arousal and valence). ### Experimental Results The authors conducted experiments on the RECOLA and SEW A datasets to verify the effectiveness of the proposed framework. The experimental results show that the framework outperforms existing methods in predicting arousal and valence, achieving 0.834 and 0.671 on the RECOLA dataset, and 0.573 and 0.533 on the SEW A dataset, respectively. ### Main Contributions 1. **Multimodal Information Fusion**: A model-level fusion method based on attention mechanisms is proposed, effectively learning the complex nonlinear relationships between different modalities and promoting the dynamic interaction of emotional information. 2. **Temporal Context Dependencies**: The local contextual temporal convolutional network and global contextual temporal convolutional network are proposed, combining multi-scale contextual information along the time axis to gradually integrate multi-scale temporal contextual information. 3. **End-to-End Trainable Model Framework**: An end-to-end trainable model framework that integrates multimodal data is constructed for frame-level emotion state prediction. ### Conclusion The multimodal continuous emotion recognition framework proposed in this paper not only addresses the challenges of multimodal data fusion and temporal context modeling but also significantly improves the performance of emotion recognition, providing new directions for research in continuous emotion recognition.

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Emotion Recognition in Videos via Fusing Multimodal Features.

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Residual multimodal Transformer for expression‐EEG fusion continuous emotion recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

An Improved Multimodal Dimension Emotion Recognition Based on Different Fusion Methods

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Temporal Convolutional Network-Enhanced Real-Time Implicit Emotion Recognition with an Innovative Wearable fNIRS-EEG Dual-Modal System

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multi-head attention fusion networks for multi-modal speech emotion recognition

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals

Multimodal Emotion Recognition From EEG Signals and Facial Expressions

Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

E-MFNN: an emotion-multimodal fusion neural network framework for emotion recognition

Multi-modal fusion network with complementarity and importance for emotion recognition