Abstract:Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at <a class="link-external link-https" href="https://openstorypp.github.io/" rel="external noopener nofollow">this https URL</a>

Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Story-driven Video Editing

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

Towards Long Video Understanding via Fine-detailed Video Story Generation

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

Text2Video: an End-to-end Learning Framework for Expressing Text with Videos

StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

Video Timeline Modeling For News Story Understanding

Visual Storylines: Semantic Visualization of Movie Sequence.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

CStory: A Chinese Large-scale News Storyline Dataset.

Improving Visual Storytelling with Multimodal Large Language Models

StoryGAN: A Sequential Conditional GAN for Story Visualization

Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding

Video In-context Learning

TeViS:Translating Text Synopses to Video Storyboards