Abstract:Emotional Video Captioning is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. The essential of the EVC task is to effectively perceive subtle and ambiguous visual emotional cues during the caption generation, which is neglected by the traditional video captioning. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation, which neglects two characteristics of the EVC task. Firstly, their methods neglect the dynamic subtle changes in the intrinsic emotions of the video, which makes it difficult to meet the needs of common scenes with diverse and changeable emotions. Secondly, as their methods incorporate emotional cues into each step, the guidance role of emotion is overemphasized, which makes factual content more or less ignored during generation. To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. Specifically, in the dynamic emotion perception path, we propose a dynamic emotion evolution module, which first aggregates visual features and historical caption features to summarize the global visual emotional cues, and then dynamically selects emotional cues required to be re-composed at each stage. Besides, in the adaptive caption generation path, to balance the description of factual content and emotional cues, we propose an emotion adaptive decoder. Thus, our methods can generate emotion-related words at the necessary time step, and our caption generation balances the guidance of factual content and emotional cues well. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module.

Emotion Aware Reinforcement Network for Visual Storytelling

TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling

Emotion Reinforced Visual Storytelling.

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

AOG-LSTM: An Adaptive Attention Neural Network for Visual Storytelling

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

SentiStory: A Multi-Layered Sentiment-Aware Generative Model for Visual Storytelling

Stimuli-Aware Visual Emotion Analysis

Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

Progressive Visual Content Understanding Network for Image Emotion Classification

EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

Contextualize, Show and Tell: A Neural Visual Storyteller

Dual-path Collaborative Generation Network for Emotional Video Captioning

Informative Visual Storytelling with Cross-modal Rules

Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation

Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling

ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer

Coherent Visual Storytelling Via Parallel Top-Down Visual and Topic Attention

RoViST:Learning Robust Metrics for Visual Storytelling