Abstract:Similar to humans perceiving visual scenes as objects, Object-Centric Learning (OCL) can abstract dense images or videos into sparse object-level features. Transformer-based OCL handles complex textures well due to the decoding guidance of discrete representation, obtained by discretizing noisy features in image or video feature maps using template features from a codebook. However, treating features as minimal units overlooks their composing attributes, thus impeding model generalization; indexing features with natural numbers loses attribute-level commonalities and characteristics, thus diminishing heuristics for model convergence. We propose \textit{Grouped Discrete Representation} (GDR) to address these issues by grouping features into attributes and indexing them with tuple numbers. In extensive experiments across different query initializations, dataset modalities, and model architectures, GDR consistently improves convergence and generalizability. Visualizations show that our method effectively captures attribute-level information in features. The source code will be available upon acceptance.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing Transformer - based Object - Centric Learning (OCL) methods, when processing image or video features, regard features as the smallest units and ignore their constituent properties, thus hindering the generalization ability of the model. In addition, using natural number - indexed features will lose the commonalities and characteristics at the attribute level, weakening the heuristic signals for model convergence.
Specifically, the paper points out:
1. **Limitations of existing methods**:
- Regarding features as the smallest units and ignoring their constituent properties, which hinders the generalization ability of the model.
- Using natural number - indexed features, it is unable to capture the intrinsic information of features, especially the commonalities and characteristics at the attribute level, thus weakening the heuristic signals for model convergence.
2. **Proposed method**:
- Proposed **Grouped Discrete Representation (GDR)**, which solves the above problems by grouping features into attributes and indexing them with tuple numbers.
- Verified by experiments, GDR can consistently improve convergence and generalization ability under different query initializations, dataset modalities and model architectures.
3. **Specific improvements**:
- **Grouped discrete representation**: Decompose features into multiple attribute groups, each attribute group has its own sub - codebook, and select attributes through tuple indexing to combine into the final feature representation.
- **Enhanced learning signal**: Through grouping, the same first/second number indicates that the object has the same color/shape attribute, providing a stronger learning signal and enhancing the model's convergence.
- **Improved generalization ability**: Each attribute group contains fewer possible values, making each code reused more often, thereby improving the model's generalization ability.
In summary, this paper aims to solve the problem of over - simplification of feature representation in existing OCL methods by introducing Grouped Discrete Representation (GDR), and then improve the model's convergence and generalization ability.