Systematic Visual Reasoning through Object-Centric Relational Abstraction

Taylor W. Webb,Shanka Subhra Mondal,Jonathan D. Cohen
2023-11-11
Abstract:Human visual reasoning is characterized by an ability to identify abstract patterns from only a small number of examples, and to systematically generalize those patterns to novel inputs. This capacity depends in large part on our ability to represent complex visual inputs in terms of both objects and relations. Recent work in computer vision has introduced models with the capacity to extract object-centric representations, leading to the ability to process multi-object visual inputs, but falling short of the systematic generalization displayed by human reasoning. Other recent models have employed inductive biases for relational abstraction to achieve systematic generalization of learned abstract rules, but have generally assumed the presence of object-focused inputs. Here, we combine these two approaches, introducing Object-Centric Relational Abstraction (OCRA), a model that extracts explicit representations of both objects and abstract relations, and achieves strong systematic generalization in tasks (including a novel dataset, CLEVR-ART, with greater visual complexity) involving complex visual displays.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: how to make computer vision models possess human - like systematic generalization ability, be able to recognize abstract patterns from only a few examples, and systematically generalize these patterns to new inputs. Specifically, the paper aims to develop a relational reasoning algorithm that can handle complex multi - object visual inputs, thereby making significant progress in the systematic generalization of abstract rules. ### Problem Background An important feature of human visual reasoning is the ability to recognize abstract patterns from a small number of examples and systematically generalize these patterns to new inputs. This ability depends on our ability to represent objects and their relationships in complex visual inputs. However, existing computer vision models have deficiencies in this regard: 1. **Limitations of Existing Models**: - Although some models have been able to extract object - centric representations, they perform poorly in systematic generalization. - Other models have achieved strong systematic generalization by introducing inductive biases of relational abstraction, but these models usually assume that the input is already segmented objects and cannot directly handle complex multi - object scenes. ### Solutions Proposed in the Paper To solve the above problems, the paper proposes the Object - Centric Relational Abstraction (OCRA) model, which combines the advantages of object - centric representation learning and relational abstraction, as follows: 1. **Object - Centric Representation Learning**: - Use the slot attention mechanism to extract object - centric representations from complex multi - object visual inputs. Each object is represented as a combination of feature embedding and position embedding, enabling the model to clearly distinguish the feature and position information of the object. 2. **Relational Embedding Computation**: - Introduce a new relational embedding method and calculate pairwise relationships between objects through the relational operator \(\phi\). The specific formula is: \[ \phi(z_k, z_{k'})=(z_k W_z\cdot z_{k'} W_z) W_r \] where \(z_k\) and \(z_{k'}\) are the feature embeddings of the objects, and \(W_z\) and \(W_r\) are linear projection weight matrices. In this way, the model can abstract the relationships between objects, not just their features. 3. **High - Order Relationship Processing**: - Use the Transformer architecture to process all pairwise relationship embeddings and extract higher - order relationship patterns. This step is crucial for recognizing abstract rules because abstract rules are usually defined among multiple relationships. 4. **Experimental Verification**: - Conducted experimental verification on multiple visual reasoning tasks, including ART, SVRT, and a newly created dataset CLEVR - ART. The results show that OCRA exhibits significantly better systematic generalization ability than other baseline models in these tasks. ### Summary The main contribution of this paper is to propose a new model, OCRA, which can not only extract object - centric representations from complex multi - object visual inputs but also achieve strong systematic generalization ability through relational abstraction. This makes OCRA perform well in handling abstract visual reasoning tasks, especially when facing unseen objects and complex scenes.