Abstract:Dynamic facial expression recognition (FER) databases provide important data support for affective computing and applications. However, most FER databases are annotated with several basic mutually exclusive emotional categories and contain only one modality, e.g., videos. The monotonous labels and modality cannot accurately imitate human emotions and fulfill applications in the real world. In this paper, we propose MAFW, a large-scale multi-modal compound affective database with 10,045 video-audio clips in the wild. Each clip is annotated with a compound emotional category and a couple of sentences that describe the subjects' affective behaviors in the clip. For the compound emotion annotation, each clip is categorized into one or more of the 11 widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness, surprise, contempt, anxiety, helplessness, and disappointment. To ensure high quality of the labels, we filter out the unreliable annotations by an Expectation Maximization (EM) algorithm, and then obtain 11 single-label emotion categories and 32 multi-label emotion categories. To the best of our knowledge, MAFW is the first in-the-wild multi-modal database annotated with compound emotion annotations and emotion-related captions. Additionally, we also propose a novel Transformer-based expression snippet feature learning method to recognize the compound emotions leveraging the expression-change relations among different emotions and modalities. Extensive experiments on MAFW database show the advantages of the proposed method over other state-of-the-art methods for both uni- and multi-modal FER. Our MAFW database is publicly available from <a class="link-external link-https" href="https://mafw-database.github.io/MAFW" rel="external noopener nofollow">this https URL</a>.

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

EmoFace: Audio-driven Emotional 3D Face Animation

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Talking Face Generation With Audio-Deduced Emotional Landmarks

4DME: A Spontaneous 4D Micro-Expression Dataset with Multimodalities

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Audio-Driven Emotional Video Portraits

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

HEU Emotion: A Large-scale Database for Multi-modal Emotion Recognition in the Wild

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Building a Chinese Natural Emotional Audio-Visual Database

Emotional Audio-Visual Speech Synthesis Based on PAD

MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild

ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion

M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database