Abstract:DALL-E and Sora have gained attention by producing implausible images, such as "astronauts riding a horse in space." Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages 'scene graph programming', a method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. We demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene: 1) a self-improving framework where models iteratively enhance their performance using generated data, 2) a distillation process to transfer specific strengths from proprietary models to open-source counterparts, and 3) improvements in content moderation by identifying and generating challenging synthetic data.

ModelScope Text-to-Video Technical Report

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Towards A Better Metric for Text-to-Video Generation

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Technical Report: Competition Solution For Modelscope-Sora

Text-Animator: Controllable Visual Text Video Generation