Abstract:Emotion detection, hence, is an important step toward making human-computer interaction a more enhanced process, where systems are made capable of identifying and responding to the emotional state of users. In fact, multimodal emotion detection systems in which both auditory and visual information are fused are emerging, and these approaches toward expressive emotional states are complementary and robust. Multimodal systems enhance the quality of interacting and, through many applications, can diagnose emotional disorders, monitor automotive safety, and improve human-robot interactions. In nature, the high-dimensional space and dynamic threats have resulted in obtaining low accuracy and high computational cost in applying the traditional models based on single-modality data. On the other hand, multimodal systems explore the synergy between audio and visual data, giving better performance and higher accuracy in inferring subtle emotional expressions. The latest improvement was done on these systems using recent advancements in transfer learning and deep learning techniques.That being said, this research Proposal devises a multimodal emotion recognition system integrating speech and face information through transfer learning for improved accuracy and robustness. Serving this purpose, the objectives of this research entail the effective comparison among different transfer-learning strategies, including the impact of pre-trained models in speech-based emotion recognition, and to introduce the role of voice activity detection in the process. Advanced neural network architectures like Spatial Transformer Networks and bidirectional LSTM in facial emotion recognition will also be tested. Early and late fusion strategies will also be used to find the best strategy for combining speech and facial data.This research will target several challenges that involve the complexity of data, balancing of the model performance-robustness balance, computational limitations, and standardization of evaluations in developing a working and robust emotion recognition system to enhance digital interaction and apply in practical areas. The goal is to create a system that oversteps the limitation of single-modality models through state-of-the-art advances in deep learning, as well as front-line improvements in transfer learning, in the manner of emotion detection performance.

SeLF: A Deep Neural Network Based Multimodal Sequential Late Fusion Approach for Human Emotion Recognition

Emotion Recognition in Videos via Fusing Multimodal Features.

Automatic Emotion Recognition Using Temporal Multimodal Deep Learning

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Multimodal Emotion Recognition Using Different Fusion Techniques

Enhancing Emotion Recognition through Multimodal Systems and Advanced Deep Learning Techniques

Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities

FusionSense: Emotion Classification Using Feature Fusion of Multimodal Data and Deep Learning in a Brain-Inspired Spiking Neural Network

Real-time emotional health detection using fine-tuned transfer networks with multimodal fusion

Deep learning-based late fusion of multimodal information for emotion classification of music video

Multimodal modelling of human emotion using sound, image and text fusion

Facial Expression Recognition Using Visible, IR, and MSX Images by Early and Late Fusion of Deep Learning Models

Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature set

Enhanced multimodal emotion recognition in healthcare analytics: A deep learning based model-level fusion approach

Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition.

A VGG16 Based Hybrid Deep Convolutional Neural Network Based Real-Time Video Frame Emotion Detection System for Affective Human Computer Interaction

Emotion Recognition System from Speech and Visual Information based on Convolutional Neural Networks

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Multimodal Emotion Recognition Framework Using a Decision-Level Fusion and Feature-Level Fusion Approach