Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning

Younggyo Seo,Kimin Lee,Ignasi Clavera,Thanard Kurutach,Jinwoo Shin,Pieter Abbeel
DOI: https://doi.org/10.48550/arXiv.2010.13303
2020-10-26
Abstract:Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However, learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we present a new model-based RL algorithm, coined trajectory-wise multiple choice learning, that learns a multi-headed dynamics model for dynamics generalization. The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments. Moreover, we incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector, enabling the model to perform online adaptation to unseen environments. Finally, to utilize the specialized prediction heads more effectively, we propose an adaptive planning method, which selects the most accurate prediction head over a recent experience. Our method exhibits superior zero-shot generalization performance across a variety of control tasks, compared to state-of-the-art RL methods. Source code and videos are available at <a class="link-external link-https" href="https://sites.google.com/view/trajectory-mcl" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the generalization ability of the model in different dynamic environments. Specifically, although model - based reinforcement learning (MBRL) performs well in terms of sample efficiency and final performance, the robustness of its dynamic model remains a challenge in dynamically changing environments. This is because the target - transfer dynamics follow a multimodal distribution, making it difficult for the dynamic model to provide accurate predictions. ### Specific problem description: 1. **Multimodal distribution**: When the dynamic characteristics of the environment change, the dynamic model needs to be able to handle multimodal distributions. For example, a robot may have different leg configurations (such as some legs being damaged), which will cause its motion pattern to present multiple different state distributions (see Figure 1b). Therefore, the dynamic model needs to be able to capture these multimodal features. 2. **Insufficient generalization ability**: Existing MBRL methods perform poorly when facing unseen dynamic changes. For example, in the real world, a robot may encounter unexpected terrains or environmental changes, and the current dynamic model often fails to provide reliable predictions in such cases. 3. **Poor online adaptability**: In order to cope with unknown environments, the model needs to have online adaptability, that is, it can quickly adjust the prediction model according to the new environment. The existing methods have limited capabilities in this regard, especially when the dynamic changes are large. ### Solution: To solve the above problems, the author proposes a new trajectory - wise multiple - choice learning algorithm (T - MCL). This method captures multimodal distributions by introducing a multi - headed dynamics model and updates by selecting the most accurate prediction head, thereby achieving self - adaptation to different environments. In addition, T - MCL also combines context learning, enabling the model to use past dynamic information for online adaptation. ### Main contributions: - **Multi - headed dynamics model**: Each prediction head is specifically designed for a certain type of similar dynamic environment, thereby improving the generalization ability of the model. - **Trajectory - level multiple - choice learning**: By selecting the most accurate prediction head for updating, different dynamic environments are automatically discovered and clustered. - **Context learning**: Past experiences are encoded as context vectors to help the model better adapt to new environments. - **Adaptive planning**: By selecting the most accurate prediction head in the most recent experiences for action selection, the generalization performance of the model is further improved. Experimental results show that T - MCL exhibits superior zero - sample generalization performance on a variety of control tasks, especially in environments with large dynamic changes.