BlobGEN-3D: Compositional 3D-Consistent Freeview Image Generation with 3D Blobs

Chao Liu,Weili Nie,Sifei Liu,Abhishek Badki,Hang Su,Morteza Mardani,Benjamin Eckart,Arash Vahdat
DOI: https://doi.org/10.1145/3680528.3687645
2024-01-01
Abstract:Recent advances in text-to-image diffusion models have significantly enhanced image generation quality, when trained on internet-scale data. However, existing methods are constrained by their reliance on image or scene-level conditions, limiting their ability to synthesize composable 3D objects in a complex scene. To address these limitations, we propose BlobGEN-3D, a novel approach that decouples compositional 3D scene representation from 2D image generation, enabling direct controllability in the 3D space while fully leveraging the capabilities of 2D diffusion models. Specifically, BlobGEN-3D utilizes object-level 3D blobs with rich textual descriptions as the 3D scene representation, which is amenable to 2D projection, and is seamlessly integrable with 2D diffusion models. Based on this representation, we introduce an auto-regressive pipeline for freeview image generation, by conditioning the pretrained blob-grounded 2D text-to-image diffusion model on the previously generated image. Our method has three key features: (i) it enables modular representation of 3D scene elements; (ii) coherent cross-view 2D generation; and (iii) manipulation of object appearance in the generated image sequences. Our method not only competes with the existing multi-view and optimization-based approaches, but also offers object-level appearance control, which was not possible before with alternatives that solely rely on scene-level descriptions, or image captions.
What problem does this paper attempt to address?