Abstract:Human visual reasoning is characterized by an ability to identify abstract patterns from only a small number of examples, and to systematically generalize those patterns to novel inputs. This capacity depends in large part on our ability to represent complex visual inputs in terms of both objects and relations. Recent work in computer vision has introduced models with the capacity to extract object-centric representations, leading to the ability to process multi-object visual inputs, but falling short of the systematic generalization displayed by human reasoning. Other recent models have employed inductive biases for relational abstraction to achieve systematic generalization of learned abstract rules, but have generally assumed the presence of object-focused inputs. Here, we combine these two approaches, introducing Object-Centric Relational Abstraction (OCRA), a model that extracts explicit representations of both objects and abstract relations, and achieves strong systematic generalization in tasks (including a novel dataset, CLEVR-ART, with greater visual complexity) involving complex visual displays.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: how to make computer vision models possess human - like systematic generalization ability, be able to recognize abstract patterns from only a few examples, and systematically generalize these patterns to new inputs. Specifically, the paper aims to develop a relational reasoning algorithm that can handle complex multi - object visual inputs, thereby making significant progress in the systematic generalization of abstract rules. ### Problem Background An important feature of human visual reasoning is the ability to recognize abstract patterns from a small number of examples and systematically generalize these patterns to new inputs. This ability depends on our ability to represent objects and their relationships in complex visual inputs. However, existing computer vision models have deficiencies in this regard: 1. **Limitations of Existing Models**: - Although some models have been able to extract object - centric representations, they perform poorly in systematic generalization. - Other models have achieved strong systematic generalization by introducing inductive biases of relational abstraction, but these models usually assume that the input is already segmented objects and cannot directly handle complex multi - object scenes. ### Solutions Proposed in the Paper To solve the above problems, the paper proposes the Object - Centric Relational Abstraction (OCRA) model, which combines the advantages of object - centric representation learning and relational abstraction, as follows: 1. **Object - Centric Representation Learning**: - Use the slot attention mechanism to extract object - centric representations from complex multi - object visual inputs. Each object is represented as a combination of feature embedding and position embedding, enabling the model to clearly distinguish the feature and position information of the object. 2. **Relational Embedding Computation**: - Introduce a new relational embedding method and calculate pairwise relationships between objects through the relational operator \(\phi\). The specific formula is: \[ \phi(z_k, z_{k'})=(z_k W_z\cdot z_{k'} W_z) W_r \] where \(z_k\) and \(z_{k'}\) are the feature embeddings of the objects, and \(W_z\) and \(W_r\) are linear projection weight matrices. In this way, the model can abstract the relationships between objects, not just their features. 3. **High - Order Relationship Processing**: - Use the Transformer architecture to process all pairwise relationship embeddings and extract higher - order relationship patterns. This step is crucial for recognizing abstract rules because abstract rules are usually defined among multiple relationships. 4. **Experimental Verification**: - Conducted experimental verification on multiple visual reasoning tasks, including ART, SVRT, and a newly created dataset CLEVR - ART. The results show that OCRA exhibits significantly better systematic generalization ability than other baseline models in these tasks. ### Summary The main contribution of this paper is to propose a new model, OCRA, which can not only extract object - centric representations from complex multi - object visual inputs but also achieve strong systematic generalization ability through relational abstraction. This makes OCRA perform well in handling abstract visual reasoning tasks, especially when facing unseen objects and complex scenes.

Systematic Visual Reasoning through Object-Centric Relational Abstraction

Learning to reason over visual objects

Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach

Slot Abstractors: Toward Scalable Abstract Visual Reasoning

Understanding the computational demands underlying visual reasoning

A Cognitively-Inspired Neural Architecture for Visual Abstract Reasoning Using Contrastive Perceptual and Conceptual Processing

Abstract Visual Reasoning Enabled by Language

From Recognition to Cognition: Visual Commonsense Reasoning

Learning Visual Reasoning Without Strong Priors

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Multi-Granularity Modularized Network for Abstract Visual Reasoning

Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

A Unified View of Abstract Visual Reasoning Problems

Abstract Visual Reasoning by Self-supervised Contrastive Learning

CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments

Visual Explanation by High-Level Abduction: On Answer-Set Programming Driven Reasoning about Moving Objects

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Exploring diagram-based visual problem representation and relational abstraction

Towards A Unified Neural Architecture for Visual Recognition and Reasoning

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning