Abstract:Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.

What problem does this paper attempt to address?

The paper mainly aims to address the following issues: ### Research Background and Objectives - **Research Background**: In visual data representation, object-centric learning aims to represent visual data through a set of object entities (i.e., slots), which can provide a structured representation to support systematic generalization. Although recent methods have made significant progress in unsupervised object discovery using Transformer architectures, slot-based representations have great potential in generative modeling, such as controllable image generation and object manipulation in image editing. - **Existing Issues**: Current slot-based methods often produce blurry images and distorted objects, showing poor generative modeling capabilities. ### Main Research Objectives - **Improve Slot-to-Image Decoding**: The paper focuses on improving the slot-to-image decoding process, which is a key part of high-quality visual generation. - **Propose Method**: Introduce SlotDiffusion—an object-centric latent diffusion model (LDM) specifically designed for image and video data. This method leverages the powerful modeling capabilities of LDM and outperforms previous slot models on six datasets, excelling in unsupervised object segmentation and visual generation. - **Expand Application Scope**: Demonstrate that SlotDiffusion can be seamlessly integrated into existing object-centric dynamic models, improving video prediction quality and downstream temporal reasoning tasks, and can be extended to unconstrained real-world datasets such as PASCAL VOC and COCO. ### Core Contributions 1. **SlotDiffusion Model**: An object-centric learning method based on diffusion models. 2. **Performance Improvement**: Achieved state-of-the-art results on image and video datasets when applied to unsupervised object discovery and visual generation. 3. **Dynamic Model Integration**: Showed that the learned slots can be directly used by state-of-the-art object-centric dynamic models, thereby improving future prediction and temporal reasoning performance. 4. **Application to Real-World Datasets**: Extended SlotDiffusion to real-world datasets by integrating with self-supervised pre-trained image encoders. In short, this paper aims to improve the generative capabilities of slot-based models, particularly their performance on complex data, to enhance the quality of object discovery and visual generation.

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

Object-Centric Slot Diffusion

Guided Latent Slot Diffusion for Object-Centric Learning

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

One Diffusion to Generate Them All

Dual Diffusion for Unified Image Generation and Understanding

InsertDiffusion: Identity Preserving Visualization of Objects through a Training-Free Diffusion Architecture

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Open-vocabulary Object Segmentation with Diffusion Models

Boosting Latent Diffusion with Perceptual Objectives

Do text-free diffusion models learn discriminative visual representations?

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

Slot-VTON: Subject-Driven Diffusion-Based Virtual Try-on with Slot Attention

Diffusion Models For Multi-Modal Generative Modeling

Diffusion Models already have a Semantic Latent Space

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model