Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation

Huazhe Xu,Boyuan Chen,Yang Gao,Trevor Darrell
DOI: https://doi.org/10.48550/arXiv.1910.08143
2021-03-15
Abstract:It is a long-standing challenge to enable an intelligent agent to learn in one environment and generalize to an unseen environment without further data collection and finetuning. In this paper, we consider a zero shot generalization problem setup that complies with biological intelligent agents' learning and generalization processes. The agent is first presented with previous experiences in the training environment, along with task description in the form of trajectory-level sparse rewards. Later when it is placed in the new testing environment, it is asked to perform the task without any interaction with the testing environment. We find this setting natural for biological creatures and at the same time, challenging for previous methods. Behavior cloning, state-of-art RL along with other zero-shot learning methods perform poorly on this benchmark. Given a set of experiences in the training environment, our method learns a neural function that decomposes the sparse reward into particular regions in a contingency-aware observation as a per step reward. Based on such decomposed rewards, we further learn a dynamics model and use Model Predictive Control (MPC) to obtain a policy. Since the rewards are decomposed to finer-granularity observations, they are naturally generalizable to new environments that are composed of similar basic elements. We demonstrate our method on a wide range of environments, including a classic video game -- Super Mario Bros, as well as a robotic continuous control task. Please refer to the project page for more visualized results.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve zero - shot generalization in new environments. Specifically, the paper focuses on how to make agents learn in the training environment and be able to perform tasks in unseen new environments without further data collection and fine - tuning. This setting is in line with the learning and generalization processes of biological agents, but poses challenges to existing methods because these methods perform poorly in new benchmarks, such as behavior cloning, the latest reinforcement learning methods, and other zero - shot learning methods. The paper proposes a new method. By learning a neural function from the set of experiences in the training environment, this function can decompose sparse rewards into per - step rewards in specific regions. Based on these decomposed rewards, a dynamic model is further learned, and model predictive control (MPC) is used to obtain a policy. Since the rewards are decomposed into finer - grained observations, they can naturally be generalized to new environments composed of similar basic elements. ### Specific Problem Description 1. **Challenges of Zero - Shot Generalization**: - Agents need to learn in the training environment and be able to perform tasks in unseen new environments without any additional interaction or fine - tuning. - This setting is in line with the learning and generalization processes of biological agents, but poses challenges to existing methods because these methods perform poorly in new benchmarks. 2. **Decomposition of Sparse Rewards**: - The paper proposes a method to decompose sparse terminal rewards into rewards at specific time and space locations by learning from trajectory data in the training environment. - By decomposing rewards, agents can better understand the consequences of their actions, thereby generalizing in new environments. 3. **Dynamic Model and Planning**: - Based on the decomposed rewards, a dynamic model is further learned, and model predictive control (MPC) is used to generate a policy. - The dynamic model and the decomposed rewards work together, enabling agents to effectively perform tasks in new environments. ### Solutions 1. **Spatio - Temporal Reward Decomposition**: - **Temporal Reward Decomposition**: Decompose the rewards at each time step through a neural network to obtain per - step rewards. - **Spatial Reward Decomposition**: Divide the observation space into multiple sub - regions to further decompose the rewards in order to better understand the importance of different regions. 2. **Dynamic Model Learning**: - Use a self - supervised learning method to learn a forward - looking dynamic model from exploration data. This model can predict future states based on the current state and action. 3. **Model Predictive Control (MPC)**: - Utilize the learned scoring function and dynamic model to find the optimal sequence of actions through the MPC algorithm to perform tasks in new environments. ### Experimental Verification The paper conducted experimental verification in two challenging environments: 1. **Super Mario Bros**: - By comparing with multiple baseline methods, it shows the performance advantages of the proposed method in the training environment and the test environment. - Especially in the case of zero - shot generalization, the proposed method significantly outperforms other methods. 2. **3D Robot Tasks**: - In a 3D robot environment, through different obstacle configurations, the generalization ability of the proposed method is verified. - The results show that the proposed method performs well in both the training environment and the test environment, especially when dealing with 3D environments. ### Main Contributions 1. **Spatio - Temporal Decomposition of Sparse Rewards**: - By decomposing sparse rewards into finer - grained time and space observations, the ability of zero - shot generalization is improved. 2. **Novel Method**: - Propose a method that combines the learned decomposition function and the neural dynamic model, which performs well in a variety of challenging tasks. In summary, by introducing spatio - temporal reward decomposition and dynamic model learning, this paper successfully solves the problem of achieving zero - shot generalization in new environments and verifies its effectiveness and superiority in multiple experiments.