Abstract:Multimodal emotion recognition has the potential to impact various fields, including human-computer interaction, virtual reality, and emotional intelligence systems. This study introduces a comprehensive framework that enhances the accuracy and computational efficiency of emotion recognition by leveraging knowledge distillation and transfer learning, incorporating both unimodal and multimodal models. The framework also combines subject-specific and subject-independent models, achieving a balance between localization and generalization. Subject-independent models include EEG-based, non-EEG-based (i.e., electromyography, electrooculography, electrodermal activity, galvanic skin response, skin temperature, respiration, blood volume pulse, heart rate, and eye movements), and multimodal models trained on all training subjects, capturing a broader context. Subject-specific models, including EEG-based, non-EEG-based, and multimodal models, are trained on individual subjects to provide localized knowledge. The proposed framework then distills knowledge from these teacher models into a student model, utilizing six different distillation losses to combine both subject-independent and subject-specific insights. This approach makes the model subject-aware by using local patterns and modality-aware by incorporating unimodal data, enhancing the robustness and generalizability of emotion recognition systems to varied real-world scenarios. The framework was tested on two well-known datasets, SEED-V and DEAP, as well as an immersive three-Dimensional (3D) Virtual Reality (VR) dataset, GraffitiVR, which captures emotional and behavioral responses from individuals experiencing urban graffiti in a VR environment. This broader application provides insights into the effectiveness of emotion recognition models in both 2D and 3D settings, facilitating a wider range of assessment. Empirical results demonstrate that the proposed knowledge distillation-based model significantly elevates performance across all datasets when compared to traditional models. Specifically, the model demonstrated improvements ranging from 6.56% to 24.59% over unimodal models and from 1.56% to 4.11% over multimodal approaches across the SEED-V, DEAP, and GraffitiVR datasets. These results underscore the robustness and effectiveness of the proposed approach, suggesting that it significantly enhances emotion recognition processes across various environmental settings.

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Cross-Modal Knowledge Transfer via Inter-Modal Translation and Alignment for Affect Recognition

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Multimodal Emotional Classification Based on Meaningful Learning

Enhancing Emotion Recognition through Multimodal Systems and Advanced Deep Learning Techniques

Modality- and Subject-Aware Emotion Recognition Using Knowledge Distillation

Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

Multi-Modal Emotion Detection with Transfer Learning

Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

Decoupled Multimodal Distilling for Emotion Recognition

Joint Multimodal Transformer for Emotion Recognition in the Wild

Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition

LLM-Enhanced Multi-Teacher Knowledge Distillation for Modality-Incomplete Emotion Recognition in Daily Healthcare

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Multimodal interaction enhanced representation learning for video emotion recognition

Multimodal transformer augmented fusion for speech emotion recognition

A Novel Dual-Modal Emotion Recognition Algorithm with Fusing Hybrid Features of Audio Signal and Speech Context

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation