Abstract:Scene understanding and decomposition is a crucial challenge for intelligent systems, whether it is for object manipulation, navigation, or any other task. Although current machine and deep learning approaches for object detection and classification obtain high accuracy, they typically do not leverage interaction with the world and are limited to a set of objects seen during training. Humans on the other hand learn to recognize and classify different objects by actively engaging with them on first encounter. Moreover, recent theories in neuroscience suggest that cortical columns in the neocortex play an important role in this process, by building predictive models about objects in their reference frame. In this article, we present an enactive embodied agent that implements such a generative model for object interaction. For each object category, our system instantiates a deep neural network, called Cortical Column Network (CCN), that represents the object in its own reference frame by learning a generative model that predicts the expected transform in pixel space, given an action. The model parameters are optimized through the active inference paradigm, i.e., the minimization of variational free energy. When provided with a visual observation, an ensemble of CCNs each vote on their belief of observing that specific object category, yielding a potential object classification. In case the likelihood on the selected category is too low, the object is detected as an unknown category, and the agent has the ability to instantiate a novel CCN for this category. We validate our system in an simulated environment, where it needs to learn to discern multiple objects from the YCB dataset. We show that classification accuracy improves as an embodied agent can gather more evidence, and that it is able to learn about novel, previously unseen objects. Finally, we show that an agent driven through active inference can choose their actions to reach a preferred observation.

Universal embodied intelligence: learning from crowd, recognizing the world, and reinforced with experience

Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence

Efficient multitask learning with an embodied predictive model for door opening and entry with whole-body control

Learning body models: from humans to humanoids

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

Is Imitation All You Need? Generalized Decision-Making with Dual-Phase Training

Multi-Task Multi-Agent Shared Layers are Universal Cognition of Multi-Agent Coordination

An Interactive Agent Foundation Model

LLM as A Robotic Brain: Unifying Egocentric Memory and Control

Body Calibration: Automatic Inter-Task Mapping between Multi-Legged Robots with Different Embodiments in Transfer Reinforcement Learning

The Journey/DAO/TAO of Embodied Intelligence: From Large Models to Foundation Intelligence and Parallel Intelligence

An Embodied Generalist Agent in 3D World

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

Embodied Object Representation Learning and Recognition

Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement

MA-Dreamer: Coordination and communication through shared imagination

Universal Morphology Control via Contextual Modulation

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

From Machine Learning to Robotics: Challenges and Opportunities for Embodied Intelligence