Abstract:Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky challenge for any autonomous agent. Previous methods have used variational autoencoders to encode a scene into a low-dimensional vector that can be used as a goal for an agent to discover new skills. Nevertheless, in compositional/multi-object environments it is difficult to disentangle all the factors of variation into such a fixed-length representation of the whole scene. We propose to use object-centric representations as a modular and structured observation space, which is learned with a compositional generative world model. We show that the structure in the representations in combination with goal-conditioned attention policies helps the autonomous agent to discover and learn useful skills. These skills can be further combined to address compositional tasks like the manipulation of several different objects.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How can autonomous agents acquire the multiple skills required to perform new tasks when relying only on high - dimensional, unstructured, and unlabeled observational data?** Specifically, the author focuses on how, in a multi - object environment, agents can learn useful skills through self - supervised learning and goal - conditioned policies, and be able to combine these skills to complete complex tasks. ### Problem Background 1. **Challenges of High - Dimensional Observational Data**: - Autonomous agents usually need to process high - dimensional (such as images) and unstructured data streams from the environment. - These data do not have clear labels or reward signals, making it difficult to learn effective skills from them. 2. **Limitations of Existing Methods**: - Many previous methods used variational auto - encoders (VAE) to encode scenes into low - dimensional vectors as the target representations of agents. - However, in complex environments containing multiple objects, this fixed - length representation is difficult to decouple all the factors of variation, resulting in a decline in performance. ### Solution The author proposes a new method - **SMORL (Self - supervised Multi - Object Reinforcement Learning)**, which combines the following two key elements: 1. **Object - Based Representation Learning**: - Use the SCALOR (Structured Compositional Latent Object Representation) model to learn structured object representations from the original sensory input. - SCALOR decomposes the scene into representations of multiple objects, and each object is described by features such as its position and appearance, thus avoiding the binding problem of traditional VAE in multi - object scenes. 2. **Goal - Conditioned Attention Policy**: - Design an attention mechanism under goal - conditioned, enabling the policy to flexibly focus on the parts of the object representation required for the current task. - In this way, the agent can gradually solve each subtask instead of handling all objects simultaneously, thus simplifying the learning process. ### Main Contributions 1. **Significantly Improved Performance**: - Verified by experiments, SMORL shows better performance than existing methods in multi - object environments, especially when dealing with complex tasks (such as rearranging multiple objects). 2. **Potential for the Real - World**: - Propose a framework that enables agents to autonomously discover and learn useful skills with only image input, which provides the possibility for future promotion in practical robot applications. 3. **Generalization Ability**: - Experiments show that SMORL has a certain generalization ability and can perform well with the number of objects not seen in the training environment. In conclusion, this paper aims to solve the problem of autonomous agents learning complex skills in multi - object environments by introducing object - based representation learning and goal - conditioned attention policies.

Self-supervised Visual Reinforcement Learning with Object-centric Representations

Linking vision and motion for self-supervised object-centric perception

Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints

Visual Reinforcement Learning with Self-Supervised 3D Representations

Self-Supervised Visual Planning with Temporal Skip Connections

Online Object Representations with Contrastive Learning

Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations

Object-sensitive Deep Reinforcement Learning

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation

A Computational Account Of Self-Supervised Visual Learning From Egocentric Object Play

Learning Explicit Object-Centric Representations with Vision Transformers

Embodied Object Representation Learning and Recognition

A Partially Supervised Reinforcement Learning Framework for Visual Active Search

Unsupervised Learning of Compositional Scene Representations from Multiple Unspecified Viewpoints

Object-Centric Scene Representations Using Active Inference

Graphical Object-Centric Actor-Critic

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Unsupervised Video Object Segmentation for Deep Reinforcement Learning

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation