Abstract:Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at <a class="link-external link-https" href="https://openstorypp.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper aims to address the issue of current image generation models struggling to maintain consistency across multiple instances when handling long text descriptions. Specifically, while existing models can generate high-quality single images, they often fail to maintain coherence of these instances across different images when faced with long text descriptions containing multiple instances. This is mainly due to the lack of fine annotations of instance features in existing training datasets. To solve this problem, the researchers proposed Openstory++, a large-scale dataset that combines instance-level annotations in both images and texts. This dataset can be used to train multimodal generation models, enabling the models to focus on specific instances when generating visual stories. Additionally, the paper introduces a customized training method that emphasizes entity-centric image-text generation, ensuring that the model can effectively integrate visual and textual information. To evaluate the model's image generation capability under long text descriptions, the researchers also developed a new benchmark framework called Cohere-Bench. This framework particularly focuses on the consistency of background, style, and instances in image generation tasks, which is crucial for assessing the model's performance in handling complex narrative environments. Through experimental validation, the Openstory++ dataset and its accompanying training method have shown advantages in improving the quality of visual narrative models, especially in handling complex open-domain generation tasks. This work fills a critical gap in the field of multimodal generation and advances the development of models capable of generating and interpreting complex narratives in open environments.

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

Outline to Story: Fine-grained Controllable Story Generation from Cascaded Events

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

CoIn: A Lightweight and Effective Framework for Story Visualization and Continuation

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

Improved Visual Story Generation with Adaptive Context Modeling

SEED-Story: Multimodal Long Story Generation with Large Language Model

Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling

Generating Visual Stories with Grounded and Coreferent Characters

Knowledge-Enriched Visual Storytelling

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

Improving Visual Storytelling with Multimodal Large Language Models

ChinaOpen: A Dataset for Open-world Multimodal Learning

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation