Abstract:Purpose Although numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations. Design/methodology/approach A novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations. Findings Extensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance. Originality/value The proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

An Efficient Approach for Audio-Visual Emotion Recognition with Missing Labels and Missing Modalities

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Self-attention fusion for audiovisual emotion recognition with incomplete data

A robust multimodal approach for emotion recognition

Audio-visual Based Emotion Recognition-a New Approach

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech

Multimodal Emotion Recognition by Combining Physiological Signals and Facial Expressions: a Preliminary Study.

Multimodal Emotional Classification Based on Meaningful Learning

Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities

Versatile audio-visual learning for emotion recognition

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Analyzing Audiovisual Data for Understanding User's Emotion in Human-Computer Interaction Environment

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention

Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition