Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

Yao Mu,Yuzheng Zhuang,Fei Ni,Bin Wang,Jianyu Chen,Jianye Hao,Ping Luo
DOI: https://doi.org/10.48550/arXiv.2210.04209
2022-10-09
Abstract:Adapting to the changes in transition dynamics is essential in robotic applications. By learning a conditional policy with a compact context, context-aware meta-reinforcement learning provides a flexible way to adjust behavior according to dynamics changes. However, in real-world applications, the agent may encounter complex dynamics changes. Multiple confounders can influence the transition dynamics, making it challenging to infer accurate context for decision-making. This paper addresses such a challenge by Decomposed Mutual INformation Optimization (DOMINO) for context learning, which explicitly learns a disentangled context to maximize the mutual information between the context and historical trajectories, while minimizing the state transition prediction error. Our theoretical analysis shows that DOMINO can overcome the underestimation of the mutual information caused by multi-confounded challenges via learning disentangled context and reduce the demand for the number of samples collected in various environments. Extensive experiments show that the context learned by DOMINO benefits both model-based and model-free reinforcement learning algorithms for dynamics generalization in terms of sample efficiency and performance in unseen environments.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to effectively extract and utilize context information in meta - reinforcement learning (Meta - RL) when environmental dynamic changes are simultaneously affected by multiple factors (i.e., multiple confounding variables). Specifically, the paper proposes a method named DOMINO (DecOmposed Mutual INformation Optimization), which learns decoupled context representations through decomposed mutual information optimization, thereby improving the generalization ability and sample efficiency of the model in unseen multi - confounding - variable environments. ### Background of the Paper and Problem Definition In robotic applications, the ability to adapt to environmental dynamic changes is crucial. Context - aware meta - reinforcement learning provides a flexible way to adjust behavior according to dynamic changes by learning a conditional policy and combining it with a compact context. However, in practical applications, agents may encounter complex dynamic changes, and multiple confounding variables can affect state - transition dynamics, making context inference in the decision - making process difficult. This leads to the following challenges: - **Complexity Caused by Multiple Confounding Variables**: Multiple factors (such as mass, length, damping, etc.) simultaneously affect environmental dynamics, increasing the difficulty of context inference. - **Sample Efficiency**: In multi - confounding - variable environments, a large number of samples are required to accurately capture dynamic information, which limits the generalization ability of the model. ### The DOMINO Method To solve the above problems, the DOMINO method proposes the following innovations: 1. **Decoupled Context Learning**: DOMINO explicitly learns decoupled context representations by maximizing the mutual information (MI) between the context and the historical trajectory while minimizing the state - transition prediction error. This method can better capture the influence of different confounding variables and improve the accuracy of the context. 2. **Decomposed Mutual Information Optimization**: DOMINO decomposes the complete mutual information optimization problem into multiple smaller optimization sub - problems and reduces the need for a large number of samples by learning decoupled contexts. Theoretical analysis shows that this decomposition method can alleviate the underestimation of mutual information by InfoNCE (a commonly used method for estimating the lower bound of mutual information) in multi - confounding - variable environments. ### Experimental Results The paper verifies the effectiveness of DOMINO through extensive experiments: - **Performance of Model - Based Methods**: In unseen multi - confounding - variable environments, DOMINO outperforms existing methods (such as T - MCL) in both generalization performance and sample efficiency. For example, in the Cripple - Ant environment, the performance of DOMINO is approximately 2.6 times higher than that of T - MCL. - **Performance of Model - Agnostic Methods**: The context representation learned by DOMINO can also be used as a plug - in module to significantly improve the generalization ability of model - agnostic methods (such as PPO) in multi - confounding - variable environments. Experimental results show that PPO + DOMINO performs well in a variety of complex environments, especially in tasks such as HalfCheetah, Ant, and Hopper. ### Conclusion DOMINO effectively solves the challenges of context inference in multi - confounding - variable environments and improves the generalization ability and sample efficiency of the model through decoupled context learning and decomposed mutual information optimization. This method is not only applicable to model - based meta - reinforcement learning but also applicable to model - agnostic reinforcement learning methods, demonstrating its broad application potential.