Abstract:Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at <a class="link-external link-https" href="https://creatilayout.github.io" rel="external noopener nofollow">this https URL</a>.

Amodal Layout Completion in Complex Outdoor Scenes.

Image Amodal Completion: A Survey

Open-World Amodal Appearance Completion

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Amodal Ground Truth and Completion in the Wild

Amodal segmentation just like doing a jigsaw

Image Synthesis from Layout with Locality-Aware Mask Adaption

MonoLayout: Amodal scene layout from a single image

Layout Generation for Various Scenarios in Mobile Shopping Applications.

Neural Rendering in a Room: Amodal 3D Understanding and Free-Viewpoint Rendering for the Closed Scene Composed of Pre-Captured Objects

PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus

360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception

LayoutPrompter: Awaken the Design Ability of Large Language Models

AutoLay: Benchmarking amodal layout estimation for autonomous driving

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Multimodal Shape Completion Via Conditional Generative Adversarial Networks

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models