TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning

Ruijie Zheng,Xiyao Wang,Yanchao Sun,Shuang Ma,Jieyu Zhao,Huazhe Xu,Hal Daumé III,Furong Huang
2024-05-24
Abstract:Despite recent progress in reinforcement learning (RL) from raw pixel data, sample inefficiency continues to present a substantial obstacle. Prior works have attempted to address this challenge by creating self-supervised auxiliary tasks, aiming to enrich the agent's learned representations with control-relevant information for future state prediction. However, these objectives are often insufficient to learn representations that can represent the optimal policy or value function, and they often consider tasks with small, abstract discrete action spaces and thus overlook the importance of action representation learning in continuous control. In this paper, we introduce TACO: Temporal Action-driven Contrastive Learning, a simple yet powerful temporal contrastive learning approach that facilitates the concurrent acquisition of latent state and action representations for agents. TACO simultaneously learns a state and an action representation by optimizing the mutual information between representations of current states paired with action sequences and representations of the corresponding future states. Theoretically, TACO can be shown to learn state and action representations that encompass sufficient information for control, thereby improving sample efficiency. For online RL, TACO achieves 40% performance boost after one million environment interaction steps on average across nine challenging visual continuous control tasks from Deepmind Control Suite. In addition, we show that TACO can also serve as a plug-and-play module adding to existing offline visual RL methods to establish the new state-of-the-art performance for offline visual RL across offline datasets with varying quality.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low sample efficiency problem faced when learning complex continuous - control tasks from high - dimensional pixel data in Visual Reinforcement Learning (VRL). Although progress has been made in reinforcement learning based on raw pixel data in recent years, sample efficiency remains a major obstacle. Existing methods enrich the learning representations of agents by creating self - supervised auxiliary tasks to predict future states, but these methods are usually insufficient to learn representations that can represent optimal policies or value functions, and often only consider tasks with small - scale abstract discrete action spaces, ignoring the importance of action representation learning in continuous control. To this end, the paper introduces a new method - **TACO (Temporal Action - driven COntrastive Learning)**, which is a simple and powerful temporal contrastive learning method aimed at simultaneously obtaining the latent representations of states and actions. TACO simultaneously learns state and action representations by optimizing the mutual information between the representation of the current state and action sequence and the representation of the corresponding future state. Theoretically, TACO can be proven to learn state and action representations that contain sufficient control information, thereby improving sample efficiency. Experimental results show that in online reinforcement learning, TACO improves the average performance by 40% after 1 million environmental interaction steps on nine challenging visual continuous - control tasks in the DeepMind Control Suite. In addition, TACO can also be added as a plug - and - play module to existing offline visual reinforcement learning methods to achieve state - of - the - art performance on offline datasets of different qualities. ### Specific Contributions 1. **Propose TACO**: A simple and effective temporal contrastive learning framework that simultaneously learns state and action representations. 2. **Flexible Framework**: TACO can be easily integrated into online and offline visual reinforcement learning algorithms with almost no need to adjust the architecture and hyperparameters. 3. **Theoretical Guarantee**: It is theoretically proven that the objective of TACO is sufficient to capture the key information in state and action representations for control. 4. **Empirical Effect**: Experiments prove that TACO significantly outperforms existing state - of - the - art model - free visual reinforcement learning algorithms on nine challenging tasks in the DeepMind Control Suite, with a 1.4 - fold performance improvement. In offline reinforcement learning, TACO also significantly improves performance. ### Method Overview The core of TACO lies in learning state and action representations by maximizing the mutual information between the representation of the current state and action sequence and the representation of the future state. Specifically, given state \( S_t \) and action \( A_t \), their latent representations are \( Z_t=\phi(S_t) \) and \( U_t = \psi(A_t) \) respectively. TACO aims to maximize the following mutual information: \[ J_{\text{TACO}}=I(Z_{t + K};[Z_t,U_t,\ldots,U_{t + K - 1}]) \] where \( K\geq1 \) is a fixed hyperparameter of the prediction range. In actual implementation, the InfoNCE loss is used to estimate the lower bound of the mutual information. ### Experimental Results - **Online Reinforcement Learning**: TACO significantly outperforms other model - free and model - based visual reinforcement learning algorithms on nine tasks in the DeepMind Control Suite, especially on complex tasks such as Quadruped Run and Reacher Hard. - **Offline Reinforcement Learning**: TACO can be combined as a plug - and - play module with existing offline reinforcement learning methods (such as TD3 + BC and CQL) to further improve performance, especially on offline datasets of different qualities. ### Conclusion TACO significantly improves the sample efficiency and performance of visual reinforcement learning by simultaneously learning state and action representations, and is applicable to online and offline settings. This method is not only theoretically proven, but also demonstrates its effectiveness on multiple benchmark tasks.