Abstract:Emotional video captioning (EVC) is an emerging task to describe the factual content with the inherent emotion expressed in a video. It is crucial for the EVC task to effectively perceive subtle and ambiguous visual emotion cues in the stage of caption generation. However, existing captioning methods usually overlooked the learning of emotions in user-generated videos, thus making the generated sentence a bit boring and soulless. To address this issue, this paper proposes a new emotional captioning perspective in a human-like perception-priority manner, i.e., first perceiving the inherent emotion and then leveraging the perceived emotion cue to support caption generation. Specifically, we devise an Emotion-Prior Awareness Network (EPAN). It mainly benefits from a novel tree-structured emotion learning module involving both catalog-level psychological categories and lexical-level usual words to achieve the goal of explicit and fine-grained emotion perception. Besides, we develop a novel subordinate emotion masking mechanism between the catalog level and lexical level that facilitates coarse-to-fine emotion learning. Afterward, with the emotion prior, we can effectively decode the emotional caption by exploiting the complementation of visual, textual, and emotional semantics. In addition, we also introduce three simple yet effective optimization objectives, which can significantly boost the emotion learning from the perspectives of emotional captioning, hierarchical emotion classification, and emotional contrastive learning. Sufficient experimental results on three benchmark datasets clearly demonstrate the advantages of our proposed EPAN over existing SOTA methods in both semantic and emotional metrics. The extensive ablation study and visualization analysis further reveal the good interpretability of our emotional video captioning method. Code will be made available at https://github.com/songpipi/EPAN.

Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction

Visual Affective Classification by Combining Visual and Text Features

Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing

Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

Learning Emotion Representations from Verbal and Nonverbal Communication

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis

Large Vision-Language Models as Emotion Recognizers in Context Awareness

Bridge then Begin Anew: Generating Target-relevant Intermediate Model for Source-free Visual Emotion Adaptation

Adaptive Deep Metric Learning for Affective Image Retrieval and Classification

Context-aware Emotion Recognition Based on Vision-Language Pre-trained Model

Stimuli-Aware Visual Emotion Analysis

Deep parallel contextual analysis framework based emotion prediction in community wellness communications on social media

Versatile audio-visual learning for emotion recognition

Learning emotional prompt features with multiple views for visual emotion analysis

Emotion-Prior Awareness Network for Emotional Video Captioning

Exploring Affective Image Representation with Visual Attention and Aesthetic Fusion

Bridging the Cross-Modality Semantic Gap in Visual Question Answering

UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception