Abstract:Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However, learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we present a new model-based RL algorithm, coined trajectory-wise multiple choice learning, that learns a multi-headed dynamics model for dynamics generalization. The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments. Moreover, we incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector, enabling the model to perform online adaptation to unseen environments. Finally, to utilize the specialized prediction heads more effectively, we propose an adaptive planning method, which selects the most accurate prediction head over a recent experience. Our method exhibits superior zero-shot generalization performance across a variety of control tasks, compared to state-of-the-art RL methods. Source code and videos are available at <a class="link-external link-https" href="https://sites.google.com/view/trajectory-mcl" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the generalization ability of the model in different dynamic environments. Specifically, although model - based reinforcement learning (MBRL) performs well in terms of sample efficiency and final performance, the robustness of its dynamic model remains a challenge in dynamically changing environments. This is because the target - transfer dynamics follow a multimodal distribution, making it difficult for the dynamic model to provide accurate predictions. ### Specific problem description: 1. **Multimodal distribution**: When the dynamic characteristics of the environment change, the dynamic model needs to be able to handle multimodal distributions. For example, a robot may have different leg configurations (such as some legs being damaged), which will cause its motion pattern to present multiple different state distributions (see Figure 1b). Therefore, the dynamic model needs to be able to capture these multimodal features. 2. **Insufficient generalization ability**: Existing MBRL methods perform poorly when facing unseen dynamic changes. For example, in the real world, a robot may encounter unexpected terrains or environmental changes, and the current dynamic model often fails to provide reliable predictions in such cases. 3. **Poor online adaptability**: In order to cope with unknown environments, the model needs to have online adaptability, that is, it can quickly adjust the prediction model according to the new environment. The existing methods have limited capabilities in this regard, especially when the dynamic changes are large. ### Solution: To solve the above problems, the author proposes a new trajectory - wise multiple - choice learning algorithm (T - MCL). This method captures multimodal distributions by introducing a multi - headed dynamics model and updates by selecting the most accurate prediction head, thereby achieving self - adaptation to different environments. In addition, T - MCL also combines context learning, enabling the model to use past dynamic information for online adaptation. ### Main contributions: - **Multi - headed dynamics model**: Each prediction head is specifically designed for a certain type of similar dynamic environment, thereby improving the generalization ability of the model. - **Trajectory - level multiple - choice learning**: By selecting the most accurate prediction head for updating, different dynamic environments are automatically discovered and clustered. - **Context learning**: Past experiences are encoded as context vectors to help the model better adapt to new environments. - **Adaptive planning**: By selecting the most accurate prediction head in the most recent experiences for action selection, the generalization performance of the model is further improved. Experimental results show that T - MCL exhibits superior zero - sample generalization performance on a variety of control tasks, especially in environments with large dynamic changes.

Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning

Learning Parsimonious Dynamics for Generalization in Reinforcement Learning

Hierarchical Prototypes for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning

Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

Reward-Consistent Dynamics Models Are Strongly Generalizable for Offline Reinforcement Learning

A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning

Learning Dynamics Models for Model Predictive Agents

Dynamics Generalization via Information Bottleneck in Deep Reinforcement Learning

A Multi-step Loss Function for Robust Learning of the Dynamics in Model-based Reinforcement Learning

Trajectory Optimization for Unknown Constrained Systems using Reinforcement Learning

General Robot Dynamics Learning and Gen2Real

Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models

Trajectory Planning for Autonomous Vehicles Using Hierarchical Reinforcement Learning

Data-efficient Deep Reinforcement Learning for Vehicle Trajectory Control

Learning Guidance Rewards with Trajectory-space Smoothing

Sub-trajectory clustering with deep reinforcement learning

Learning to Walk from Three Minutes of Real-World Data with Semi-structured Dynamics Models

Efficient Reinforcement Learning Through Trajectory Generation

Improving Generalization in Reinforcement Learning Training Regimes for Social Robot Navigation

Trajectory Planning with Deep Reinforcement Learning in High-Level Action Spaces

Single-Trajectory Distributionally Robust Reinforcement Learning