Abstract:Similar to humans perceiving visual scenes as objects, Object-Centric Learning (OCL) can abstract dense images or videos into sparse object-level features. Transformer-based OCL handles complex textures well due to the decoding guidance of discrete representation, obtained by discretizing noisy features in image or video feature maps using template features from a codebook. However, treating features as minimal units overlooks their composing attributes, thus impeding model generalization; indexing features with natural numbers loses attribute-level commonalities and characteristics, thus diminishing heuristics for model convergence. We propose \textit{Grouped Discrete Representation} (GDR) to address these issues by grouping features into attributes and indexing them with tuple numbers. In extensive experiments across different query initializations, dataset modalities, and model architectures, GDR consistently improves convergence and generalizability. Visualizations show that our method effectively captures attribute-level information in features. The source code will be available upon acceptance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing Transformer - based Object - Centric Learning (OCL) methods, when processing image or video features, regard features as the smallest units and ignore their constituent properties, thus hindering the generalization ability of the model. In addition, using natural number - indexed features will lose the commonalities and characteristics at the attribute level, weakening the heuristic signals for model convergence. Specifically, the paper points out: 1. **Limitations of existing methods**: - Regarding features as the smallest units and ignoring their constituent properties, which hinders the generalization ability of the model. - Using natural number - indexed features, it is unable to capture the intrinsic information of features, especially the commonalities and characteristics at the attribute level, thus weakening the heuristic signals for model convergence. 2. **Proposed method**: - Proposed **Grouped Discrete Representation (GDR)**, which solves the above problems by grouping features into attributes and indexing them with tuple numbers. - Verified by experiments, GDR can consistently improve convergence and generalization ability under different query initializations, dataset modalities and model architectures. 3. **Specific improvements**: - **Grouped discrete representation**: Decompose features into multiple attribute groups, each attribute group has its own sub - codebook, and select attributes through tuple indexing to combine into the final feature representation. - **Enhanced learning signal**: Through grouping, the same first/second number indicates that the object has the same color/shape attribute, providing a stronger learning signal and enhancing the model's convergence. - **Improved generalization ability**: Each attribute group contains fewer possible values, making each code reused more often, thereby improving the model's generalization ability. In summary, this paper aims to solve the problem of over - simplification of feature representation in existing OCL methods by introducing Grouped Discrete Representation (GDR), and then improve the model's convergence and generalization ability.

Grouped Discrete Representation Guides Object-Centric Learning

Organized Grouped Discrete Representation for Object-Centric Learning

Learning to Group Discrete Graphical Patterns

Learning Object-Centric Representation via Reverse Hierarchy Guidance

OVPT: Optimal Viewset Pooling Transformer for 3D Object Recognition.

Object Pursuit: Building a Space of Objects Via Discriminative Weight Generation

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

Explicitly Disentangled Representations in Object-Centric Learning

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Via discriminative weight generation

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

Learning Global Object-Centric Representations via Disentangled Slot Attention

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

CrOC: Cross-View Online Clustering for Dense Visual Representation Learning

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Associating Objects with Scalable Transformers for Video Object Segmentation

CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction

Exploring Figure-Ground Assignment Mechanism in Perceptual Organization

Hierarchical Graph Interaction Transformer With Dynamic Token Clustering for Camouflaged Object Detection

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning