Abstract:Multi-modal emotion analysis, as an important direction in affective computing, has attracted increasing attention in recent years. Most existing multi-modal emotion recognition studies are targeted at a classification task that aims to assign a specific emotion category to a combination of several heterogeneous input data, including multimedia signals and physiological signals. Compared to single-class emotion recognition, a growing number of recent psychological evidence suggests that different discrete emotions may co-exist at the same time, which promotes the development of mixed-emotion recognition to identify a mixture of basic emotions. Although most current studies treat it as a multi-label classification task, in this work, we focus on a challenging situation where both positive and negative emotions are presented simultaneously, and propose a multi-modal mixed emotion recognition framework, namely EmotionDict. The key characteristics of our EmotionDict include the following. (1) Inspired by the psychological evidence that such a mixed state can be represented by combinations of basic emotions, we address mixed emotion recognition as a label distribution learning task. An emotion dictionary has been designed to disentangle the mixed emotion representations into a weighted sum of a set of basic emotion elements in a shared latent space and their corresponding weights. (2) While many existing emotion distribution studies are built on a single type of multimedia signal (such as text, image, audio, and video), we incorporate physiological and overt behavioral multi-modal signals, including electroencephalogram (EEG), peripheral physiological signals, and facial videos, which directly display the subjective emotions. These modalities have diverse characteristics given that they are related to the central or peripheral nervous system, and the motor cortex. (3) We further design auxiliary tasks to learn modality attentions for modality integration. Experiments on two datasets show that our method outperforms existing state-of-the-art approaches on mixed-emotion recognition.

A Versatile Multimodal Learning Framework For Zero-shot Emotion Recognition

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Multimodal emotion recognition based on audio and text by using hybrid attention networks

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Leveraging Label Information for Multimodal Emotion Recognition

Zero-Shot Emotion Recognition Via Affective Structural Embedding.

A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Incomplete Data Scenarios

Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Emotion Dictionary Learning with Modality Attentions for Mixed Emotion Exploration

Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Multimodal interaction enhanced representation learning for video emotion recognition

Decoupled Multimodal Distilling for Emotion Recognition