Abstract:How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state information from observations, they often ignore the inherent dynamics relationships among the elements that are essential for learning accurate representations, which further impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To alleviate this problem, we propose an intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR. Concretely, DSR optimizes the parameterized encoder by the state-transition dynamics of the underlying system, which prompts the latent encoding information to satisfy the state-transition process and then the state space and the noise space can be distinguished. In the implementation and to further improve the representation ability of DSR on encoding similar tasks, sequential elements' frequency domain and multi-step prediction are adopted for sequentially modeling the inherent dynamics. Finally, experimental results show that DSR has achieved significant performance improvements in the visual Distracting DMControl control tasks, especially with an average of 78.9\% over the backbone baseline. Further results indicate that it also achieves the best performances in real-world autonomous driving applications on the CARLA simulator. Moreover, qualitative analysis results validate that our method possesses the superior ability to learn generalizable scene representations on visual tasks. The source code is available at <a class="link-external link-https" href="https://github.com/DMU-XMU/DSR" rel="external noopener nofollow">this https URL</a>.

DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning

Hierarchical Adaptive Value Estimation for Multi-modal Visual Reinforcement Learning

Effective Multimodal Reinforcement Learning with Modality Alignment and Importance Enhancement

DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck

Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

Exploiting Multi-modal Fusion for Robust Face Representation Learning with Missing Modality

Tackling Visual Control via Multi-View Exploration Maximization

Learning Latent Dynamic Robust Representations for World Models

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning.

MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition.

Multimodal Deep Representation Learning for Video Classification

A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL

DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

Simoun: Synergizing Interactive Motion-appearance Understanding for Vision-based Reinforcement Learning

RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning

Stabilizing Visual Reinforcement Learning Via Asymmetric Interactive Cooperation

Domain Adaptive State Representation Alignment for Reinforcement Learning