Abstract:In the latest social networks, more and more people prefer to express their emotions in videos through text, speech, and rich facial expressions. Multimodal video emotion analysis techniques can help understand users' inner world automatically based on human expressions and gestures in images, tones in voices, and recognized natural language. However, in the existing research, the acoustic modality has long been in a marginal position as compared to visual and textual modalities. That is, it tends to be more difficult to improve the contribution of the acoustic modality for the whole multimodal emotion recognition task. Besides, although better performance can be obtained by introducing common deep learning methods, the complex structures of these training models always result in low inference efficiency, especially when exposed to high-resolution and long-length videos. Moreover, the lack of a fully end-to-end multimodal video emotion recognition system hinders its application. In this paper, we designed a fully multimodal video-to-emotion system (named FV2ES) for fast yet effective recognition inference, whose benefits are threefold: (1) The adoption of the hierarchical attention method upon the sound spectra breaks through the limited contribution of the acoustic modality, and outperforms the existing models' performance on both IEMOCAP and CMU-MOSEI datasets; (2) the introduction of the idea of multi-scale for visual extraction while single-branch for inference brings higher efficiency and maintains the prediction accuracy at the same time; (3) the further integration of data pre-processing into the aligned multimodal learning model allows the significant reduction of computational costs and storage space.

Mood as a Contextual Cue for Improved Emotion Inference

A Weakly Supervised Approach to Emotion-change Prediction and Improved Mood Inference

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

To Improve Is to Change: Towards Improving Mood Prediction by Learning Changes in Emotion

Focus on Change: Mood Prediction by Learning Emotion Changes via Spatio-Temporal Attention

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions.

Multi-Modal Continuous Valence And Arousal Prediction in the Wild Using Deep 3D Features and Sequence Modeling

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction

MoodCam: Mood Prediction Through Smartphone-Based Facial Affect Analysis in Real-World Settings

Self-adaptive Context and Modal-interaction Modeling For Multimodal Emotion Recognition

Going Beyond Closed Sets: A Multimodal Perspective for Video Emotion Analysis.

VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition.

Context-aware Multimodal Fusion for Emotion Recognition

Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection