Abstract:Purpose Although numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations. Design/methodology/approach A novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations. Findings Extensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance. Originality/value The proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

Exploiting evidential theory in the fusion of textual, audio, and visual modalities for affective music video retrieval

Multimedia content analysis for emotional characterization of music video clips

EmoPlayer: A Media Player for Video Clips with Affective Annotations.

Exploiting EEG signals and audiovisual feature fusion for video emotion recognition

Identifying affective levels on music video via completing the missing modality

Affective Visualization and Retrieval for Music Video

Music video affective understanding using feature importance analysis

Electroencephalography Amplitude Modulation Analysis for Automated Affective Tagging of Music Video Clips

Study on Linguistic Computing for Music Emotion

Research on Emotional Semantic Retrieval of Attention Mechanism Oriented to Audio-visual Synesthesia

Video indexing and recommendation based on affective analysis of viewers.

Temporal Enhancement for Video Affective Content Analysis

Correlation-Based Feature Selection And Regression

Hybrid video emotional tagging using users’ EEG and video content

Impact of Affective Multimedia Content on the Electroencephalogram and Facial Expressions

Analyzing Audiovisual Data for Understanding User's Emotion in Human-Computer Interaction Environment

Video affective content analysis: a survey of state of the art methods

Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses

Video Quality Prediction: An Exploratory Study With Valence and Arousal Signals

Emotional Video to Audio Transformation Using Deep Recurrent Neural Networks and a Neuro-Fuzzy System

Corpus Development for Affective Video Indexing