Abstract:Scene understanding and decomposition is a crucial challenge for intelligent systems, whether it is for object manipulation, navigation, or any other task. Although current machine and deep learning approaches for object detection and classification obtain high accuracy, they typically do not leverage interaction with the world and are limited to a set of objects seen during training. Humans on the other hand learn to recognize and classify different objects by actively engaging with them on first encounter. Moreover, recent theories in neuroscience suggest that cortical columns in the neocortex play an important role in this process, by building predictive models about objects in their reference frame. In this article, we present an enactive embodied agent that implements such a generative model for object interaction. For each object category, our system instantiates a deep neural network, called Cortical Column Network (CCN), that represents the object in its own reference frame by learning a generative model that predicts the expected transform in pixel space, given an action. The model parameters are optimized through the active inference paradigm, i.e., the minimization of variational free energy. When provided with a visual observation, an ensemble of CCNs each vote on their belief of observing that specific object category, yielding a potential object classification. In case the likelihood on the selected category is too low, the object is detected as an unknown category, and the agent has the ability to instantiate a novel CCN for this category. We validate our system in an simulated environment, where it needs to learn to discern multiple objects from the YCB dataset. We show that classification accuracy improves as an embodied agent can gather more evidence, and that it is able to learn about novel, previously unseen objects. Finally, we show that an agent driven through active inference can choose their actions to reach a preferred observation.

Look Further to Recognize Better: Learning Shared Topics and Category-Specific Dictionaries for Open-Ended 3D Object Recognition

Open-Ended Fine-Grained 3D Object Categorization by Combining Shape and Texture Features in Multiple Colorspaces

3D_DEN: Open-ended 3D Object Recognition using Dynamically Expandable Networks

Fine-grained 3D object recognition: an approach and experiments

Lifelong ensemble learning based on multiple representations for few-shot object recognition

Investigating the Importance of Shape Features, Color Constancy, Color Spaces and Similarity Measures in Open-Ended 3D Object Recognition

Vision-Based Categorical Object Pose Estimation and Manipulation.

Dictionary Learning for Robotic Grasp Recognition and Detection

Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions

Learning visual object models on a robot using context and appearance cues

Recognizing Objects In-the-wild: Where Do We Stand?

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example

Learning Category-Specific Dictionary and Shared Dictionary for Fine-Grained Image Categorization

Structured Spatial Reasoning with Open Vocabulary Object Detectors

FINE-GRAINED AND LAYERED OBJECT RECOGNITION

Embodied Object Representation Learning and Recognition

Beyond Object Recognition: A New Benchmark towards Object Concept Learning

Unseen Object Reasoning with Shared Appearance Cues

Learning-based Relational Object Matching Across Views