Abstract:Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at <a class="link-external link-https" href="https://openstorypp.github.io/" rel="external noopener nofollow">this https URL</a>

CoIn: A Lightweight and Effective Framework for Story Visualization and Continuation

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

A Framework For Image Synthesis Using Supervised Contrastive Learning

Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models

ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

StoryGAN: A Sequential Conditional GAN for Story Visualization

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

CoC-GAN: Employing Context Cluster for Unveiling a New Pathway in Image Generation

InDecGAN: Learning to Generate Complex Images from Captions Via Independent Object-Level Decomposition and Enhancement

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

TemporalStory: Enhancing Consistency in Story Visualization Using Spatial-Temporal Attention

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

Storytelling from an Image Stream Using Scene Graphs

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation