Learning Latent Dynamic Robust Representations for World Models

Ruixiang Sun,Hongyu Zang,Xin Li,Riashat Islam
2024-05-30
Abstract:Visual Model-Based Reinforcement Learning (MBRL) promises to encapsulate agent's knowledge about the underlying dynamics of the environment, enabling learning a world model as a useful planner. However, top MBRL agents such as Dreamer often struggle with visual pixel-based inputs in the presence of exogenous or irrelevant noise in the observation space, due to failure to capture task-specific features while filtering out irrelevant spatio-temporal details. To tackle this problem, we apply a spatio-temporal masking strategy, a bisimulation principle, combined with latent reconstruction, to capture endogenous task-specific aspects of the environment for world models, effectively eliminating non-essential information. Joint training of representations, dynamics, and policy often leads to instabilities. To further address this issue, we develop a Hybrid Recurrent State-Space Model (HRSSM) structure, enhancing state representation robustness for effective policy learning. Our empirical evaluation demonstrates significant performance improvements over existing methods in a range of visually complex control tasks such as Maniskill \cite{gu2023maniskill2} with exogenous distractors from the Matterport environment. Our code is avaliable at <a class="link-external link-https" href="https://github.com/bit1029public/HRSSM" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of poor performance of visual model-based reinforcement learning (MBRL) in environments with extraneous noise or irrelevant information. Specifically, existing MBRL methods such as Dreamer suffer from performance degradation when handling visual pixel inputs because they fail to effectively capture task-relevant features and filter out irrelevant spatial and temporal details. To solve this problem, the authors propose a method that combines a spatio-temporal masking strategy, dual simulation principles, and latent reconstruction to capture endogenous task-relevant aspects of the environment and effectively eliminate unnecessary information. Additionally, to further address potential instability issues in the joint training of representation, dynamics, and policy, the authors develop a Hybrid Recurrent State-Space Model (HRSSM) structure to enhance the robustness of state representation, thereby enabling effective policy learning. Experimental results demonstrate that this method significantly improves performance in a range of complex control tasks, particularly in visual environments with extraneous disturbances (e.g., Maniskill), compared to existing methods.