SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Ziyi Wu,Jingyu Hu,Wuyue Lu,Igor Gilitschenski,Animesh Garg
2023-09-22
Abstract:Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper mainly aims to address the following issues: ### Research Background and Objectives - **Research Background**: In visual data representation, object-centric learning aims to represent visual data through a set of object entities (i.e., slots), which can provide a structured representation to support systematic generalization. Although recent methods have made significant progress in unsupervised object discovery using Transformer architectures, slot-based representations have great potential in generative modeling, such as controllable image generation and object manipulation in image editing. - **Existing Issues**: Current slot-based methods often produce blurry images and distorted objects, showing poor generative modeling capabilities. ### Main Research Objectives - **Improve Slot-to-Image Decoding**: The paper focuses on improving the slot-to-image decoding process, which is a key part of high-quality visual generation. - **Propose Method**: Introduce SlotDiffusion—an object-centric latent diffusion model (LDM) specifically designed for image and video data. This method leverages the powerful modeling capabilities of LDM and outperforms previous slot models on six datasets, excelling in unsupervised object segmentation and visual generation. - **Expand Application Scope**: Demonstrate that SlotDiffusion can be seamlessly integrated into existing object-centric dynamic models, improving video prediction quality and downstream temporal reasoning tasks, and can be extended to unconstrained real-world datasets such as PASCAL VOC and COCO. ### Core Contributions 1. **SlotDiffusion Model**: An object-centric learning method based on diffusion models. 2. **Performance Improvement**: Achieved state-of-the-art results on image and video datasets when applied to unsupervised object discovery and visual generation. 3. **Dynamic Model Integration**: Showed that the learned slots can be directly used by state-of-the-art object-centric dynamic models, thereby improving future prediction and temporal reasoning performance. 4. **Application to Real-World Datasets**: Extended SlotDiffusion to real-world datasets by integrating with self-supervised pre-trained image encoders. In short, this paper aims to improve the generative capabilities of slot-based models, particularly their performance on complex data, to enhance the quality of object discovery and visual generation.