Abstract:Automatic emotion recognition (ER) has recently gained lot of interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual’s emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, that allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. By deploying the joint A-V feature representation into the cross-attention module, it helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion.

MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

MuMu: Cooperative Multitask Learning-Based Guided Multimodal Fusion

MSAF: Multimodal Split Attention Fusion

FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Memory based fusion for multi-modal deep learning

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

An Effective Multimodal Representation and Fusion Method for Multimodal Intent Recognition

Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

Deep Multimodal Data Fusion

Multimodal fusion for audio-image and video action recognition

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

MMTM: Multimodal Transfer Module for CNN Fusion

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

Memory Fusion Network for Multi-view Sequential Learning

Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

Joint Multimodal Transformer for Emotion Recognition in the Wild