Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

Orchid Chetia Phukan,Mohd Mujtaba Akhtar,Girish,Swarup Ranjan Behera,Sishir Kalita,Arun Balaji Buduru,Rajesh Sharma,S.R Mahadeva Prasanna

2024-09-22

Abstract:In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.

Audio and Speech Processing,Sound

What problem does this paper attempt to address?

The problem that this paper attempts to solve is Non - Verbal Emotion Recognition (NVER). Specifically, the authors focus on how to use Multimodal Foundation Models (MFMs) to improve the performance of non - verbal sound emotion recognition. They believe that, compared with using only Audio Foundation Models (AFMs), MFMs can better interpret and distinguish the subtle emotional cues in non - verbal sounds through joint pre - training across multiple modalities. These cues may be rather ambiguous or difficult to recognize in single - modality models. To verify this hypothesis, the researchers extracted representations from the latest MFMs and AFMs and evaluated them on standard NVER datasets. In addition, they also explored the possibility of combining the representations of different foundation models to further enhance the performance of NVER. For this purpose, they proposed a framework named MATA (Intra - Modality Alignment through Transport Attention), which effectively fuses the representations of different foundation models through the Optimal Transport technique. The research results show that MATA, combined with two MFMs, LanguageBind and ImageBind, achieved the highest accuracy and F1 - scores on multiple benchmark datasets such as ASVP - ESD, JNV, and VIVAE, which are 76.47%, 77.40%, 75.12% and 70.35%, 76.19%, 74.63% respectively, significantly outperforming single foundation models and baseline fusion techniques and reaching the current state - of - the - art (SOTA) level. In conclusion, this research aims to improve the performance of non - verbal sound emotion recognition by using MFMs and the proposed new fusion framework MATA, thereby providing more accurate emotion analysis tools for applications in related fields.

Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Multimodal Emotional Classification Based on Meaningful Learning

Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities

Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature set

A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Multimodal modelling of human emotion using sound, image and text fusion

Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

Multimodal Emotion Recognition Using Different Fusion Techniques

Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Enhancing Emotion Recognition through Multimodal Systems and Advanced Deep Learning Techniques