Offline Meta Reinforcement Learning with In-Distribution Online Adaptation

Jianhao Wang,Jin Zhang,Haozhe Jiang,Junyu Zhang,Liwei Wang,Chongjie Zhang
2023-06-02
Abstract:Recent offline meta-reinforcement learning (meta-RL) methods typically utilize task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks. To address this problem, we first formally characterize a unique challenge in offline meta-RL: transition-reward distribution shift between offline datasets and online adaptation. Our theory finds that out-of-distribution adaptation episodes may lead to unreliable policy evaluation and that online adaptation with in-distribution episodes can ensure adaptation performance guarantee. Based on these theoretical insights, we propose a novel adaptation framework, called In-Distribution online Adaptation with uncertainty Quantification (IDAQ), which generates in-distribution context using a given uncertainty quantification and performs effective task belief inference to address new tasks. We find a return-based uncertainty quantification for IDAQ that performs effectively. Experiments show that IDAQ achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper mainly focuses on a key challenge in offline meta - reinforcement learning (offline meta - RL): **transition - reward distribution shift between offline datasets and online adaptation**. Specifically, existing methods require additional information or assumptions to achieve rapid adaptation when dealing with new tasks, such as offline context. However, these additional information may not be available or difficult to obtain in practical applications. #### Research Background Offline meta - reinforcement learning aims to train a meta - policy through pre - collected multi - task offline datasets, so that it can quickly adapt to new tasks without interacting with the environment. However, existing offline meta - reinforcement learning methods have the following problems: 1. **Dependence on additional information**: Methods such as FOCAL and MACAW require offline context for rapid adaptation of test tasks. 2. **Distribution shift problem**: The transition - reward distribution in the offline dataset may be different from that in the online adaptation process, resulting in unreliable policy evaluation. #### Main Contributions To solve the above problems, the author proposes a new framework - **In - Distribution online Adaptation with uncertainty Quantification (IDAQ)**, whose main goals are: - **Identify and filter out in - distribution samples in the offline dataset** to ensure that the data used in the online adaptation process is consistent with the offline dataset. - **Utilize uncertainty quantification techniques** to generate in - distribution online adaptation samples, thereby ensuring the reliability of adaptation performance. #### Specific Methods IDAQ achieves its goals through the following steps: 1. **Define transition - reward distribution shift**: Formalize this phenomenon from the perspective of Bayesian reinforcement learning (Bayesian RL) and prove its existence. 2. **Propose theoretical insights**: Prove that in - distribution online adaptation can provide consistent performance guarantees, and that the meta - policy using Thompson sampling can generate in - distribution online adaptation samples. 3. **Design the IDAQ framework**: - **Reference phase**: Estimate the uncertainty threshold δ to determine which samples belong to in - distribution. - **Iterative update phase**: Collect online adaptation samples according to the current task belief and meta - policy, and update the in - distribution context and task belief according to the uncertainty quantification results. 4. **Uncertainty quantification**: Introduce three uncertainty quantification methods (prediction error, prediction variance, return - based quantification), and verify their effectiveness through experiments. #### Experimental Results The author conducted large - scale experiments on the Meta - World ML1 benchmark, and the results show that IDAQ significantly outperforms baseline methods on multiple task sets, and performs particularly well on medium or expert datasets. ### Summary This paper formalizes the transition - reward distribution shift problem in offline meta - reinforcement learning, proposes the IDAQ framework, solves the problem of existing methods relying on additional information, and proves its effectiveness and superiority through theoretical analysis and experiments.