Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

Orchid Chetia Phukan,Mohd Mujtaba Akhtar,Girish,Swarup Ranjan Behera,Sishir Kalita,Arun Balaji Buduru,Rajesh Sharma,S.R Mahadeva Prasanna
2024-09-22
Abstract:In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is Non - Verbal Emotion Recognition (NVER). Specifically, the authors focus on how to use Multimodal Foundation Models (MFMs) to improve the performance of non - verbal sound emotion recognition. They believe that, compared with using only Audio Foundation Models (AFMs), MFMs can better interpret and distinguish the subtle emotional cues in non - verbal sounds through joint pre - training across multiple modalities. These cues may be rather ambiguous or difficult to recognize in single - modality models. To verify this hypothesis, the researchers extracted representations from the latest MFMs and AFMs and evaluated them on standard NVER datasets. In addition, they also explored the possibility of combining the representations of different foundation models to further enhance the performance of NVER. For this purpose, they proposed a framework named MATA (Intra - Modality Alignment through Transport Attention), which effectively fuses the representations of different foundation models through the Optimal Transport technique. The research results show that MATA, combined with two MFMs, LanguageBind and ImageBind, achieved the highest accuracy and F1 - scores on multiple benchmark datasets such as ASVP - ESD, JNV, and VIVAE, which are 76.47%, 77.40%, 75.12% and 70.35%, 76.19%, 74.63% respectively, significantly outperforming single foundation models and baseline fusion techniques and reaching the current state - of - the - art (SOTA) level. In conclusion, this research aims to improve the performance of non - verbal sound emotion recognition by using MFMs and the proposed new fusion framework MATA, thereby providing more accurate emotion analysis tools for applications in related fields.