Abstract:Humans excel at abstracting data and constructing \emph{reusable} concepts, a capability lacking in current continual learning systems. The field of object-centric learning addresses this by developing abstract representations, or slots, from data without human supervision. Different methods have been proposed to tackle this task for images, whereas most are overly complex, non-differentiable, or poorly scalable. In this paper, we introduce a conceptually simple, fully-differentiable, non-iterative, and scalable method called SAMP Simplified Slot Attention with Max Pool Priors). It is implementable using only Convolution and MaxPool layers and an Attention layer. Our method encodes the input image with a Convolutional Neural Network and then uses a branch of alternating Convolution and MaxPool layers to create specialized sub-networks and extract primitive slots. These primitive slots are then used as queries for a Simplified Slot Attention over the encoded image. Despite its simplicity, our method is competitive or outperforms previous methods on standard benchmarks.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that current continual learning systems lack the ability to abstract data and construct reusable concepts. Specifically, the paper focuses on object - centric learning (OCL), that is, extracting abstract object representations (called "slots") from raw inputs (such as images, videos, etc.). These slots can be used for reasoning or decision - making, thereby improving the generalization and adaptation ability of the model. However, most of the existing OCL methods have the following problems: 1. **High complexity**: Many methods are too complex to be implemented. 2. **Non - differentiable**: Some methods are not fully differentiable, resulting in unstable training. 3. **Poor scalability**: Many methods perform poorly when dealing with large - scale data. To solve these problems, the author proposes a new method - Simplified Slot Attention with Max Pool Priors (SAMP). SAMP has the following characteristics: - **Simple concept**: Easy to understand and implement. - **Fully differentiable**: Supports end - to - end training. - **Non - iterative**: Does not require repeated iterative optimization, reducing computational overhead. - **Highly scalable**: Able to handle large - scale datasets. By introducing SAMP, the author hopes to improve the performance of existing OCL methods while maintaining simplicity and achieve competitive results on standard benchmark tests. #### Method overview The core idea of SAMP is to use convolutional neural networks (CNN) and max - pooling layers (MaxPool) to extract raw image features, and group these features through a simplified slot - attention mechanism (SSA Layer), and finally generate object representations. The specific steps are as follows: 1. **Encoder**: Use CNN layers to encode the input image, retain the spatial dimension and extract pixel - level features. 2. **Grouping module**: Create a dedicated sub - network by alternately using convolutional layers and max - pooling layers to extract primitive slots. Then use these slots as queries to input into the SSA layer. 3. **Decoder**: Each slot reconstructs a part of the image through a spatial broadcast decoder respectively, and finally obtains a complete reconstructed image through weighted summation. In this way, SAMP not only simplifies the model structure, but also introduces a competition mechanism through the max - pooling layer and the SSA layer, so that different slots can interpret different parts of the input, thereby improving the model's expressive ability and generalization ability. #### Experimental results The author conducted experiments on three standard benchmark datasets (CLEVR6, Multi - dSprites and Tetrominoes), and the results show that SAMP's performance on multi - tasks is competitive, and even exceeds the existing best methods (such as Slot Attention) on some tasks. This indicates that SAMP can effectively improve the effect of object - centric learning while maintaining simplicity.

Simplified priors for Object-Centric Learning

Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior

Improving Object-centric Learning with Query Optimization

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Object-Centric Learning with Slot Mixture Module

Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

Learning Complementary Saliency Priors for Foreground Object Segmentation in Complex Scenes

Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames

Bootstrapping Top-down Information for Self-modulating Slot Attention

Leveraging Prior Concept Learning Improves Generalization From Few Examples in Computational Models of Human Object Recognition

Deep-SLAM++: Object-level RGBD SLAM based on class-specific deep shape priors

Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification

Low-level and High-Level Prior Learning for Visual Saliency Estimation.

Leveraging Image Augmentation for Object Manipulation: Towards Interpretable Controllability in Object-Centric Learning

Guided Latent Slot Diffusion for Object-Centric Learning

Efficient Semantic Image Synthesis via Class-Adaptive Normalization

Using Image Priors to Improve Scene Understanding

Context Prior for Scene Segmentation.

Towards zero-shot object counting via deep spatial prior cross-modality fusion

Self-Attentive Pooling for Efficient Deep Learning

Multi-class Object Segmentation Based on Jointly Integrating Segment-Level and Image-Level Object Priors