Abstract:DALL-E and Sora have gained attention by producing implausible images, such as "astronauts riding a horse in space." Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages 'scene graph programming', a method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. We demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene: 1) a self-improving framework where models iteratively enhance their performance using generated data, 2) a distillation process to transfer specific strengths from proprietary models to open-source counterparts, and 3) improvements in content moderation by identifying and generating challenging synthetic data.

AnyScene: Customized Image Synthesis with Composited Foreground

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Scene Text Synthesis for Efficient and Effective Deep Network Training

ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion

ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning.

Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D Model

Composition-Aware Scene Optimization for Product Images

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Text-to-image synthesis: Starting composite from the foreground content

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion