Abstract:Scene understanding and decomposition is a crucial challenge for intelligent systems, whether it is for object manipulation, navigation, or any other task. Although current machine and deep learning approaches for object detection and classification obtain high accuracy, they typically do not leverage interaction with the world and are limited to a set of objects seen during training. Humans on the other hand learn to recognize and classify different objects by actively engaging with them on first encounter. Moreover, recent theories in neuroscience suggest that cortical columns in the neocortex play an important role in this process, by building predictive models about objects in their reference frame. In this article, we present an enactive embodied agent that implements such a generative model for object interaction. For each object category, our system instantiates a deep neural network, called Cortical Column Network (CCN), that represents the object in its own reference frame by learning a generative model that predicts the expected transform in pixel space, given an action. The model parameters are optimized through the active inference paradigm, i.e., the minimization of variational free energy. When provided with a visual observation, an ensemble of CCNs each vote on their belief of observing that specific object category, yielding a potential object classification. In case the likelihood on the selected category is too low, the object is detected as an unknown category, and the agent has the ability to instantiate a novel CCN for this category. We validate our system in an simulated environment, where it needs to learn to discern multiple objects from the YCB dataset. We show that classification accuracy improves as an embodied agent can gather more evidence, and that it is able to learn about novel, previously unseen objects. Finally, we show that an agent driven through active inference can choose their actions to reach a preferred observation.

Neural Multisensory Scene Inference

Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Scalable Spatial Memory for Scene Rendering and Navigation

Multiview Scene Graph

Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments

Towards Modeling the Interaction of Spatial-Associative Neural Network Representations for Multisensory Perception

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation

Semantic Implicit Neural Scene Representations With Semi-Supervised Training

Multi-modal Scene-compliant User Intention Estimation in Navigation

Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Multimodal Guidance Network for Missing-Modality Inference in Content Moderation

CoCoNets: Continuous Contrastive 3D Scene Representations

Scene Categorization by Deeply Learning Gaze Behavior in a Semisupervised Context

Embodied Object Representation Learning and Recognition

Learning Intuitive Physics with Multimodal Generative Models

Modeling Spatial Layout for Scene Image Understanding Via a Novel Multiscale Sum-Product Network

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

Generalization in Multimodal Language Learning from Simulation