GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Mianchu Wang,Rui Yang,Xi Chen,Hao Sun,Meng Fang,Giovanni Montana
2024-05-16
Abstract:Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize limited data and generalize to unseen goals in offline goal - conditioned reinforcement learning (GCRL). Specifically, existing offline GCRL methods are mainly based on model - free methods, which have limitations in dealing with limited data and generalizing to unseen goals. To overcome these challenges, this paper proposes a new framework named GOPlan, which is achieved through the following two key stages: 1. **Pre - training stage**: - Train a prior policy that can capture the multimodal action distribution in multi - goal datasets. - Use the Advantage - Weighted Conditioned Generative Adversarial Network (CGAN) to train the prior policy to avoid generating out - of - distribution (OOD) actions and optimize high - reward actions. - Learn a set of dynamics models for subsequent planning and uncertainty quantification. 2. **Re - analysis stage**: - Generate imaginary trajectories through planning and fine - tune the policy to further optimize performance. - Use the re - analysis method to generate high - quality imaginary data, which can enhance the agent's ability to reach goals within and outside the dataset. - Generate better data through iterative planning and fine - tune the policy using Advantage - Weighted CGAN, significantly improving policy performance while reducing the need for a large amount of offline data. ### Main contributions 1. **Propose GOPlan**: A new model - based offline GCRL algorithm that can work effectively in settings with limited data and unseen goals. 2. **Two - stage framework**: The pre - training stage learns the prior policy through Advantage - Weighted CGAN, and the re - analysis stage fine - tunes the policy by generating high - quality imaginary trajectories through planning. 3. **Experimental verification**: Conducted extensive experimental evaluations on multiple multi - goal navigation and manipulation tasks, demonstrating the effectiveness of GOPlan in benchmark tests and two challenging settings (small data budget and unseen goal generalization). ### Problems solved - **Limited data**: Existing methods do not work well when dealing with limited data. GOPlan improves performance under limited data by using dynamics models and the re - analysis method to generate high - quality imaginary data. - **Unseen goal generalization**: Existing methods are difficult to generalize to unseen goals. GOPlan can better generalize to unseen goals through the combination of Advantage - Weighted CGAN and dynamics models. In summary, by proposing the GOPlan framework, this paper solves the key problems of dealing with limited data and generalizing to unseen goals in offline GCRL, providing an effective solution for offline multi - goal reinforcement learning.