Abstract:Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at <a class="link-external link-https" href="https://openstorypp.github.io/" rel="external noopener nofollow">this https URL</a>

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

LAION-5B: An open large-scale dataset for training next generation image-text models

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Semantic Compositional Learning for Low-shot Scene Graph Generation

Structural Semantic Adversarial Active Learning for Image Captioning

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Text Pared into Scene Graph for Diverse Image Generation.

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Improving Text Generation on Images with Synthetic Captions

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Fine-Grained Scene-Graph-to-Image Model Based on SAGAN

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

Image Captioning with Multi-Context Synthetic Data

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Improving Compositional Text-to-image Generation with Large Vision-Language Models

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives