Abstract:Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes. In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip's duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach. Our dataset, codes, and evaluations will be released.

What problem does this paper attempt to address?

The problem this paper attempts to address is the generation of synchronized video narratives, specifically generating narratives that are synchronized with the visual scenes, informative, and coherent based on a given video. Specifically, the paper introduces a new task—Synchronized Video Storytelling, aimed at generating synchronized and informative narratives for videos. These narratives should be associated with the visual content of each video segment, integrate relevant external knowledge, and the word count should match the duration of the video segments. Additionally, a structured storyline helps guide the generation process, ensuring the coherence and completeness of the narrative. ### Background and Challenges 1. **Limitations of Existing Research**: - Existing video-to-text generation research mainly focuses on generating single-sentence summaries or fine-grained video captions, but these methods cannot handle synchronized video storytelling. - Multimodal Large Language Models (Multimodal LLMs) perform poorly in zero-shot or few-shot settings, especially in temporal alignment and sequential video understanding. - Generating narratives for long videos is particularly challenging for models due to the need to handle a large number of visual tokens. 2. **Needs in Practical Applications**: - In practical applications, especially in advertising and marketing, there is a need to generate narratives synchronized with videos to engage the audience and convey product information. - Structured storylines help maintain the coherence and completeness of the narrative, thereby enhancing audience engagement. ### Main Contributions of the Paper 1. **Introduction of a New Task**: - Proposed the task of synchronized video storytelling, which requires generating synchronized, informative, and coherent video narratives. 2. **Construction of a Benchmark Dataset**: - Created a benchmark dataset named E-SyncVidStory, containing 6,032 videos and 41,292 video segments, each with synchronized Chinese narratives and relevant knowledge annotations. 3. **Proposing an Effective Framework**: - Designed a framework named VideoNarrator, which combines visual models and large language models to generate storylines and synchronized narratives simultaneously. - Improved the effectiveness of visual embeddings by compressing long video frames into short visual information, retaining important frames. 4. **Systematic Evaluation Metrics**: - Introduced a set of systematic evaluation metrics, including visual relevance, knowledge relevance, controllable accuracy, fluency, etc., to comprehensively evaluate the quality of the generated stories. ### Method Overview 1. **Visual Embedding**: - Obtained frame-level visual features through visual feature extractors and video projection layers, and performed visual compression and memory integration to reduce computational complexity and memory usage. - Used relative video segment position embeddings to capture the relative positional information of video segments. 2. **Prompt Design**: - Provided controllable signals for each video segment to ensure the generated narrative word count meets the requirements. - Avoided using irrelevant knowledge by providing concise task instructions and emphasizing the use of provided information. 3. **Training Process**: - Constructed multimodal instruction training samples to maximize the probability of generating each script label and narrative. - Updated the parameters of the video projection layer and video segment position embeddings to optimize model performance. Through these methods, the paper successfully addresses the challenges of generating synchronized video narratives and provides new directions and tools for future research.

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

Story-driven Video Editing

Movie101v2: Improved Movie Narration Benchmark

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Narration Generation for Cartoon Videos

Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

Data playwright: Authoring data videos with annotated narration

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay

Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Enhancing Viewing Experience of Generated Visual Storylines for Promotional Videos

Movie2Story: A framework for understanding videos and telling stories in the form of novel text

An AI-empowered Visual Storyline Generator.

Generating Persuasive Visual Storylines for Promotional Videos

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification