Abstract:A video storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards, however, remains challenging which not only requires cross-modal association between high-level texts and images but also demands long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. We construct a MovieNet-TeViS dataset based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes manually selected from corresponding movies by considering both relevance and cinematic coherence. To benchmark the task, we present strong CLIP-based baselines and a novel VQ-Trans. VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation. Then, it auto-regressively generates a sequence of visual features for retrieval and ordering. Experimental results demonstrate that VQ-Trans significantly outperforms prior methods and the CLIP-based baselines. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work. The code and data are available at: \url{<a class="link-external link-https" href="https://ruc-aimind.github.io/projects/TeViS/" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper attempts to solve the problem of converting text synopses into video storyboards. Specifically, the authors propose a new task named **Text Synopsis to VideoStoryboard (TeViS)**, which aims to retrieve a series of ordered images from a large-scale movie database to visualize the input text synopsis. This task not only requires cross-modal association of high-level text and images but also necessitates long-term reasoning to ensure smooth transitions between images. ### Background and Challenges 1. **Limitations of Existing Work**: - **Text-to-Video Retrieval**: Mainly focuses on short video clips of a few seconds, with highly redundant images, which cannot meet the requirement of coherent keyframes for video storyboards. - **Story-to-Image Retrieval**: Aims to map detailed descriptive sentences to images one-to-one but does not require smooth transitions between images. - **Text-to-Video Generation**: Although it can generate short dynamic videos, it cannot handle complex motions and dynamics. 2. **Needs of Professional Video Production**: - High-quality video storyboards are very challenging for amateur video creators. They not only need to include relevant scenes, characters, and actions but also need to organize keyframes through professional cinematography to achieve smooth transitions. ### Solution 1. **Dataset Construction**: - **MovieNet-TeViS Dataset**: Constructed based on the publicly available MovieNet dataset, containing 10K text synopses, with an average of 4.6 keyframes paired per synopsis. These keyframes are manually selected by annotators, considering relevance and consistency with cinematography. 2. **Model Design**: - **VQ-Trans Model**: First encodes the text synopsis and images into a joint embedding space, using Vector Quantization (VQ) to improve visual representation. Then, it autoregressively generates a series of visual features for image retrieval and ordering. 3. **Evaluation Setup**: - **Ordering Task**: Provides shuffled keyframes, and the model needs to reorder them based on the text. - **Retrieval and Ordering Task**: Retrieves relevant images from 500 candidate images and arranges them in order. ### Main Contributions 1. **Proposing the TeViS Task**: Aims to retrieve an ordered sequence of images to visualize high-level text synopses. 2. **Constructing the MovieNet-TeViS Benchmark Dataset**: Contains 10K text synopses, with an average of 4.6 keyframes paired per synopsis. 3. **Establishing a Baseline Model**: Proposes the VQ-Trans model, combining vector quantization and autoregressive generation, significantly outperforming previous methods. ### Experimental Results Experimental results show that the VQ-Trans model significantly outperforms other methods in both the ordering task and the retrieval and ordering task, but there is still a considerable gap compared to human performance, indicating potential for future research. ### Conclusion The paper proposes the TeViS task and the VQ-Trans model, providing a new tool for amateur video creators to create high-quality video storyboards. Although the current model has room for improvement, this work lays the foundation for future research in cross-modal video generation.

TeViS:Translating Text Synopses to Video Storyboards

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Transcript to Video: Efficient Clip Sequencing from Texts

Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding

Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding

Consistent Video-to-Video Transfer Using Synthetic Dataset

Visual Subtitle Feature Enhanced Video Outline Generation

Text Synopsis Generation for Egocentric Videos

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

Visual Storylines: Semantic Visualization of Movie Sequence.

Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval