Abstract:We study offline meta-reinforcement learning, a practical reinforcement learning paradigm that learns from offline data to adapt to new tasks. The distribution of offline data is determined jointly by the behavior policy and the task. Existing offline meta-reinforcement learning algorithms cannot distinguish these factors, making task representations unstable to the change of behavior policies. To address this problem, we propose a contrastive learning framework for task representations that are robust to the distribution mismatch of behavior policies in training and test. We design a bi-level encoder structure, use mutual information maximization to formalize task representation learning, derive a contrastive learning objective, and introduce several approaches to approximate the true distribution of negative pairs. Experiments on a variety of offline meta-reinforcement learning benchmarks demonstrate the advantages of our method over prior methods, especially on the generalization to out-of-distribution behavior policies. The code is available at <a class="link-external link-https" href="https://github.com/PKU-AI-Edge/CORRO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in Off - line Meta - Reinforcement Learning (OMRL), how to learn task representations that are robust to the changes in the behavior policy distribution from off - line datasets. Specifically, when the behavior policy distributions in the training and testing phases are inconsistent in existing methods, task representations are easily interfered with, resulting in poor generalization performance. To solve this problem, the authors propose a contrast - learning - based framework - CORRO (COntrastive Robust task Representation learning for OMRL), aiming to improve the robustness and generalization ability of task representations by maximizing the mutual information between task representations and tasks while minimizing the influence of behavior policies. ### Main Contributions 1. **Propose a new framework**: This framework can learn robust task representations from completely off - line datasets, and these representations can distinguish different tasks in the transition distributions determined jointly by behavior policies and tasks. 2. **Design a contrast - learning objective**: By maximizing the mutual information between task representations and tasks, extract the shared features in the same task transitions, and capture the essential differences in reward functions and transition dynamics between different tasks. 3. **Experimental verification**: In multiple benchmark tests, this method outperforms previous methods in generalizing to unseen behavior policies, and even outperforms supervised task - learning methods that assume known true task descriptions. ### Method Overview 1. **Two - layer task encoder**: The first layer extracts task representations from single - step transition tuples, and the second layer aggregates these representations. 2. **Contrast learning**: Use the contrast - learning method to optimize InfoNCE, which is a lower bound of mutual information. To generate negative samples, the authors propose two methods: - **Generative modeling**: Use a conditional variational auto - encoder (CVAE) to fit the joint data distribution across tasks. - **Reward randomization**: Generate negative samples by adding noise to rewards to increase diversity. 3. **Algorithm flow**: - Pretrain the generative model (if using generative modeling). - Train the transition encoder and optimize task representations through contrast learning. - Train the policy and update the policy and Q - function using an off - line reinforcement learning algorithm. ### Experimental Results Experiments are carried out in multiple environments such as Point - Robot, Ant - Dir, Half - Cheetah - Vel, Walker - Param, and Hopper - Param. The results show that CORRO has better adaptation performance and robustness when dealing with different task distributions and off - line datasets, especially when the behavior policy distribution changes. ### Formulas - **Mutual information maximization**: \[ \max I(z; M)=\mathbb{E}_{z, M}\left[\log \frac{p(M|z)}{p(M)}\right] \] - **Contrast - learning objective**: \[ \max_{\theta_1} \sum_{M_i \in M} \sum_{x, x' \in X_i}\left[\log \frac{\exp(S(z, z'))}{\sum_{M^* \in M} \exp(S(z, z^*))}\right] \] where \(S(z, z')\) is the similarity function between the latent codes of two samples, and the cosine similarity is usually used. Through these methods and experiments, the authors demonstrate the effectiveness and robustness of CORRO in off - line meta - reinforcement learning.

Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning

Beyond Reward: Offline Preference-guided Policy Optimization

Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning

On Context Distribution Shift in Task Representation Learning for Offline Meta RL

Generalizable Task Representation Learning for Offline Meta-Reinforcement Learning with Data Limitations

Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning

Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation

Offline Meta-Reinforcement Learning with Advantage Weighting

Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing

Cost-aware Offline Safe Meta Reinforcement Learning with Robust In-Distribution Online Task Adaptation.

Effective Offline Robot Learning with Structured Task Graph

Offline Multitask Representation Learning for Reinforcement Learning

Meta-Reinforcement Learning Based on Self-Supervised Task Representation Learning

Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning

Enhancing Context-Based Meta-Reinforcement Learning Algorithms Via An Efficient Task Encoder (Student Abstract)

Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning

Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

Ensemble successor representations for task generalization in offline-to-online reinforcement learning

Solving Continual Offline Reinforcement Learning with Decision Transformer

Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learning