Abstract:How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state information from observations, they often ignore the inherent dynamics relationships among the elements that are essential for learning accurate representations, which further impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To alleviate this problem, we propose an intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR. Concretely, DSR optimizes the parameterized encoder by the state-transition dynamics of the underlying system, which prompts the latent encoding information to satisfy the state-transition process and then the state space and the noise space can be distinguished. In the implementation and to further improve the representation ability of DSR on encoding similar tasks, sequential elements' frequency domain and multi-step prediction are adopted for sequentially modeling the inherent dynamics. Finally, experimental results show that DSR has achieved significant performance improvements in the visual Distracting DMControl control tasks, especially with an average of 78.9\% over the backbone baseline. Further results indicate that it also achieves the best performances in real-world autonomous driving applications on the CARLA simulator. Moreover, qualitative analysis results validate that our method possesses the superior ability to learn generalizable scene representations on visual tasks. The source code is available at <a class="link-external link-https" href="https://github.com/DMU-XMU/DSR" rel="external noopener nofollow">this https URL</a>.

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

InterRep: A Visual Interaction Representation for Robotic Grasping

Multimodal integration learning of robot behavior using deep neural networks

Human-oriented Representation Learning for Robotic Manipulation

EC^2: Emergent Communication for Embodied Control

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Multi-Object Navigation with dynamically learned neural implicit representations

Multimodal Information Bottleneck for Deep Reinforcement Learning with Multiple Sensors

Learning to Act with Affordance-Aware Multimodal Neural SLAM

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

Effective Multimodal Reinforcement Learning with Modality Alignment and Importance Enhancement

DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning

Simple Emergent Action Representations from Multi-Task Policy Training

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Training an Interactive Humanoid Robot Using Multimodal Deep Reinforcement Learning

Seamless Integration and Coordination of Cognitive Skills in Humanoid Robots: A Deep Learning Approach

Multimodal Representation Learning for Place Recognition Using Deep Hebbian Predictive Coding