Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images

Bang You,Oleg Arenz,Youping Chen,Jan Peters
DOI: https://doi.org/10.1016/j.neucom.2021.12.094
2022-03-02
Abstract:Recent methods for reinforcement learning from images use auxiliary tasks to learn image features that are used by the agent's policy or Q-function. In particular, methods based on contrastive learning that induce linearity of the latent dynamics or invariance to data augmentation have been shown to greatly improve the sample efficiency of the reinforcement learning algorithm and the generalizability of the learned embedding. We further argue, that explicitly improving Markovianity of the learned embedding is desirable and propose a self-supervised representation learning method which integrates contrastive learning with dynamic models to synergistically combine these three objectives: (1) We maximize the InfoNCE bound on the mutual information between the state- and action-embedding and the embedding of the next state to induce a linearly predictive embedding without explicitly learning a linear transition model, (2) we further improve Markovianity of the learned embedding by explicitly learning a non-linear transition model using regression, and (3) we maximize the mutual information between the two nonlinear predictions of the next embeddings based on the current action and two independent augmentations of the current state, which naturally induces transformation invariance not only for the state embedding, but also for the nonlinear transition model. Experimental evaluation on the Deepmind control suite shows that our proposed method achieves higher sample efficiency and better generalization than state-of-art methods based on contrastive learning or reconstruction.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve sample efficiency and generalization ability in image - based reinforcement learning. Specifically, the paper proposes a new self - supervised representation learning method to achieve this goal by combining contrastive learning with a dynamic model. This method aims to solve several key problems existing in current methods: 1. **High - dimensional observation space**: Image - based reinforcement learning usually needs to handle high - dimensional observation spaces, which leads to a very large number of environmental interactions required to learn effective strategies, which is unrealistic for many practical robotic tasks. 2. **Non - Markovian state representation**: When existing reinforcement learning methods learn state representations, they often cannot well maintain the Markov property, that is, the prediction of the next state only depends on the current state and action, not on historical states and actions. Non - Markovian state representations will lead to the learned policies and value functions being unable to fully utilize the information in the environment. 3. **Insufficient data - augmentation invariance**: When existing methods handle image data, although they use data - augmentation techniques to improve the robustness of the model, these methods usually only focus on image - level invariance and ignore the invariance of the underlying dynamic model. To solve these problems, the paper proposes a method that includes three auxiliary tasks: 1. **Maximizing temporal mutual information**: By maximizing the mutual information between the current state - action embedding and the next - state embedding to induce a linear prediction embedding without explicitly learning a linear transformation model. 2. **Improving latent Markovianity**: By explicitly learning a nonlinear transformation model to further improve the Markov property of the learned embedding. 3. **Multi - view mutual information maximization**: By maximizing the mutual information between two nonlinear prediction embeddings of the current state based on the current action and two independent data - augmentations, the state embedding and the nonlinear transformation model are naturally induced to be invariant to data - augmentation. Through these methods, the new framework proposed in the paper shows higher sample efficiency and better generalization ability in experiments than existing methods based on contrastive learning or reconstruction.