Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Dingyi Yang,Chunru Zhan,Ziheng Wang,Biao Wang,Tiezheng Ge,Bo Zheng,Qin Jin
2024-05-23
Abstract:Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes. In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip's duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach. Our dataset, codes, and evaluations will be released.
Multimedia
What problem does this paper attempt to address?
The problem this paper attempts to address is the generation of synchronized video narratives, specifically generating narratives that are synchronized with the visual scenes, informative, and coherent based on a given video. Specifically, the paper introduces a new task—Synchronized Video Storytelling, aimed at generating synchronized and informative narratives for videos. These narratives should be associated with the visual content of each video segment, integrate relevant external knowledge, and the word count should match the duration of the video segments. Additionally, a structured storyline helps guide the generation process, ensuring the coherence and completeness of the narrative. ### Background and Challenges 1. **Limitations of Existing Research**: - Existing video-to-text generation research mainly focuses on generating single-sentence summaries or fine-grained video captions, but these methods cannot handle synchronized video storytelling. - Multimodal Large Language Models (Multimodal LLMs) perform poorly in zero-shot or few-shot settings, especially in temporal alignment and sequential video understanding. - Generating narratives for long videos is particularly challenging for models due to the need to handle a large number of visual tokens. 2. **Needs in Practical Applications**: - In practical applications, especially in advertising and marketing, there is a need to generate narratives synchronized with videos to engage the audience and convey product information. - Structured storylines help maintain the coherence and completeness of the narrative, thereby enhancing audience engagement. ### Main Contributions of the Paper 1. **Introduction of a New Task**: - Proposed the task of synchronized video storytelling, which requires generating synchronized, informative, and coherent video narratives. 2. **Construction of a Benchmark Dataset**: - Created a benchmark dataset named E-SyncVidStory, containing 6,032 videos and 41,292 video segments, each with synchronized Chinese narratives and relevant knowledge annotations. 3. **Proposing an Effective Framework**: - Designed a framework named VideoNarrator, which combines visual models and large language models to generate storylines and synchronized narratives simultaneously. - Improved the effectiveness of visual embeddings by compressing long video frames into short visual information, retaining important frames. 4. **Systematic Evaluation Metrics**: - Introduced a set of systematic evaluation metrics, including visual relevance, knowledge relevance, controllable accuracy, fluency, etc., to comprehensively evaluate the quality of the generated stories. ### Method Overview 1. **Visual Embedding**: - Obtained frame-level visual features through visual feature extractors and video projection layers, and performed visual compression and memory integration to reduce computational complexity and memory usage. - Used relative video segment position embeddings to capture the relative positional information of video segments. 2. **Prompt Design**: - Provided controllable signals for each video segment to ensure the generated narrative word count meets the requirements. - Avoided using irrelevant knowledge by providing concise task instructions and emphasizing the use of provided information. 3. **Training Process**: - Constructed multimodal instruction training samples to maximize the probability of generating each script label and narrative. - Updated the parameters of the video projection layer and video segment position embeddings to optimize model performance. Through these methods, the paper successfully addresses the challenges of generating synchronized video narratives and provides new directions and tools for future research.