Abstract:Causal representation learning has showed a variety of settings in which we can disentangle latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are represented as $d$-dimensional vectors, and (2) that the observations are the output of some injective generative function of these latent variables. While these assumptions appear benign, we show that when the observations are of multiple objects, the generative function is no longer injective and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture <a class="link-https" data-arxiv-id="2006.15055" href="https://arxiv.org/abs/2006.15055">arXiv:2006.15055</a>, we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object's properties. This approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key problems in causal representation learning, which are especially prominent when dealing with images containing multiple objects: 1. **Non - identifiability Problem**: - In traditional causal representation learning, it is assumed that the latent variable $z$ is mapped to the observed data $x$ through an injective generating function $g(z)$. However, when there are multiple objects in an image, this invertibility assumption no longer holds. Specifically, if an image consists of a set of objects, the order of these objects will not affect the final image output, that is, for any permutation matrix $\Pi$, $g(z)=g(\Pi z)$. This leads to the non - uniqueness problem, making it difficult to recover the latent variable $z$ from the observed data $x$. - The paper points out that when the identities of objects are indistinguishable (for example, all balls have the same color), the performance of existing disentanglement methods (such as Ahuja et al., 2022b) will be limited, and its performance upper limit is $\frac{1}{k}$, where $k$ is the number of objects. 2. **Object Identity Problem**: - When performing sparse perturbations on multiple objects, it is a challenge to determine which object has been perturbed. In the case of a single object, the identity of the object can be tracked through a consistent ordering, but in a multi - object scenario, the order of objects can be freely exchanged, and the order before and after the perturbation cannot be guaranteed to be consistent. This will cause the encoder to produce discontinuous changes in the latent space, thus making existing disentanglement methods ineffective. ### Solutions To address the above problems, the paper proposes a method based on object - centric architectures, combining recent progress in object - centric learning and causal representation learning. Specifically: - **Object - centric Architectures**: By modifying the Slot Attention architecture (Locatello et al., 2020b), a new object - centric architecture is developed, which uses weakly - supervised signals from sparse perturbations to disentangle the properties of each object. This method not only solves the non - uniqueness problem but also significantly reduces the number of perturbations required, thereby improving data efficiency. - **Weak Supervision and Matching**: A matching step is introduced to ensure that the perturbed object can be correctly identified after the perturbation is applied. This is achieved by minimizing a cost matrix that measures the differences between all slot pairs. ### Experimental Results The paper verifies the effectiveness of this method through a series of simple image - based disentanglement experiments, showing strong empirical results in 2D and 3D synthetic benchmark tests. Experiments show that the proposed method can significantly reduce the number of perturbations required while maintaining high disentanglement performance. In summary, this paper successfully solves the non - uniqueness and object identity problems in multi - object scenarios by introducing object - centric architectures, providing a new solution for causal representation learning.

Object-centric architectures enable efficient causal representation learning

Explicitly Disentangled Representations in Object-Centric Learning

Causal disentanglement of multimodal data

Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms

Learning Global Object-Centric Representations via Disentangled Slot Attention

Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior

Disentanglement of Latent Representations via Causal Interventions

Disentangled Representations for Causal Cognition

Identifiable Object-Centric Representation Learning via Probabilistic Slot Attention

Learning Causal Representation for Training Cross-Domain Pose Estimator Via Generative Interventions

Desiderata for Representation Learning: A Causal Perspective

Multi-View Causal Representation Learning with Partial Observability

Unifying Causal Representation Learning with the Invariance Principle

Do learned representations respect causal relationships?

Efficiently Disentangle Causal Representations

A Sparsity Principle for Partially Observable Causal Representation Learning

Language-Based Causal Representation Learning

Specify Robust Causal Representation from Mixed Observations.

Unpaired Multi-Domain Causal Representation Learning

Causal Representation Learning from Multiple Distributions: A General Setting

Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases