Abstract:DALL-E and Sora have gained attention by producing implausible images, such as "astronauts riding a horse in space." Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages 'scene graph programming', a method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. We demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene: 1) a self-improving framework where models iteratively enhance their performance using generated data, 2) a distillation process to transfer specific strengths from proprietary models to open-source counterparts, and 3) improvements in content moderation by identifying and generating challenging synthetic data.

ChatPainter: Improving Text to Image Generation using Dialogue

Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image Description

Teaching Text-to-Image Models to Communicate.

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation

CHATEDIT: Towards Multi-turn Interactive Facial Image Editing via Dialogue

ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

DialogPaint: A Dialog-based Image Editing Model

Improving face generation quality and prompt following with synthetic captions

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Improving Image Captioning with Better Use of Captions

DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset

Improving Image Captioning with Better Use of Caption

Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images

DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

IMAD: IMage-Augmented multi-modal Dialogue

Improving Multimodal Datasets with Image Captioning

Synthesis of Vision and Language: Multifaceted Image Captioning Application

Simple Dialogue System with AUDITED