Abstract:We study offline meta-reinforcement learning, a practical reinforcement learning paradigm that learns from offline data to adapt to new tasks. The distribution of offline data is determined jointly by the behavior policy and the task. Existing offline meta-reinforcement learning algorithms cannot distinguish these factors, making task representations unstable to the change of behavior policies. To address this problem, we propose a contrastive learning framework for task representations that are robust to the distribution mismatch of behavior policies in training and test. We design a bi-level encoder structure, use mutual information maximization to formalize task representation learning, derive a contrastive learning objective, and introduce several approaches to approximate the true distribution of negative pairs. Experiments on a variety of offline meta-reinforcement learning benchmarks demonstrate the advantages of our method over prior methods, especially on the generalization to out-of-distribution behavior policies. The code is available at <a class="link-external link-https" href="https://github.com/PKU-AI-Edge/CORRO" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in Off - line Meta - Reinforcement Learning (OMRL), how to learn task representations that are robust to the changes in the behavior policy distribution from off - line datasets. Specifically, when the behavior policy distributions in the training and testing phases are inconsistent in existing methods, task representations are easily interfered with, resulting in poor generalization performance. To solve this problem, the authors propose a contrast - learning - based framework - CORRO (COntrastive Robust task Representation learning for OMRL), aiming to improve the robustness and generalization ability of task representations by maximizing the mutual information between task representations and tasks while minimizing the influence of behavior policies.
### Main Contributions
1. **Propose a new framework**: This framework can learn robust task representations from completely off - line datasets, and these representations can distinguish different tasks in the transition distributions determined jointly by behavior policies and tasks.
2. **Design a contrast - learning objective**: By maximizing the mutual information between task representations and tasks, extract the shared features in the same task transitions, and capture the essential differences in reward functions and transition dynamics between different tasks.
3. **Experimental verification**: In multiple benchmark tests, this method outperforms previous methods in generalizing to unseen behavior policies, and even outperforms supervised task - learning methods that assume known true task descriptions.
### Method Overview
1. **Two - layer task encoder**: The first layer extracts task representations from single - step transition tuples, and the second layer aggregates these representations.
2. **Contrast learning**: Use the contrast - learning method to optimize InfoNCE, which is a lower bound of mutual information. To generate negative samples, the authors propose two methods:
- **Generative modeling**: Use a conditional variational auto - encoder (CVAE) to fit the joint data distribution across tasks.
- **Reward randomization**: Generate negative samples by adding noise to rewards to increase diversity.
3. **Algorithm flow**:
- Pretrain the generative model (if using generative modeling).
- Train the transition encoder and optimize task representations through contrast learning.
- Train the policy and update the policy and Q - function using an off - line reinforcement learning algorithm.
### Experimental Results
Experiments are carried out in multiple environments such as Point - Robot, Ant - Dir, Half - Cheetah - Vel, Walker - Param, and Hopper - Param. The results show that CORRO has better adaptation performance and robustness when dealing with different task distributions and off - line datasets, especially when the behavior policy distribution changes.
### Formulas
- **Mutual information maximization**:
\[
\max I(z; M)=\mathbb{E}_{z, M}\left[\log \frac{p(M|z)}{p(M)}\right]
\]
- **Contrast - learning objective**:
\[
\max_{\theta_1} \sum_{M_i \in M} \sum_{x, x' \in X_i}\left[\log \frac{\exp(S(z, z'))}{\sum_{M^* \in M} \exp(S(z, z^*))}\right]
\]
where \(S(z, z')\) is the similarity function between the latent codes of two samples, and the cosine similarity is usually used.
Through these methods and experiments, the authors demonstrate the effectiveness and robustness of CORRO in off - line meta - reinforcement learning.