Abstract:Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively utilize limited data and generalize to unseen goals in offline goal - conditioned reinforcement learning (GCRL). Specifically, existing offline GCRL methods are mainly based on model - free methods, which have limitations in dealing with limited data and generalizing to unseen goals. To overcome these challenges, this paper proposes a new framework named GOPlan, which is achieved through the following two key stages: 1. **Pre - training stage**: - Train a prior policy that can capture the multimodal action distribution in multi - goal datasets. - Use the Advantage - Weighted Conditioned Generative Adversarial Network (CGAN) to train the prior policy to avoid generating out - of - distribution (OOD) actions and optimize high - reward actions. - Learn a set of dynamics models for subsequent planning and uncertainty quantification. 2. **Re - analysis stage**: - Generate imaginary trajectories through planning and fine - tune the policy to further optimize performance. - Use the re - analysis method to generate high - quality imaginary data, which can enhance the agent's ability to reach goals within and outside the dataset. - Generate better data through iterative planning and fine - tune the policy using Advantage - Weighted CGAN, significantly improving policy performance while reducing the need for a large amount of offline data. ### Main contributions 1. **Propose GOPlan**: A new model - based offline GCRL algorithm that can work effectively in settings with limited data and unseen goals. 2. **Two - stage framework**: The pre - training stage learns the prior policy through Advantage - Weighted CGAN, and the re - analysis stage fine - tunes the policy by generating high - quality imaginary trajectories through planning. 3. **Experimental verification**: Conducted extensive experimental evaluations on multiple multi - goal navigation and manipulation tasks, demonstrating the effectiveness of GOPlan in benchmark tests and two challenging settings (small data budget and unseen goal generalization). ### Problems solved - **Limited data**: Existing methods do not work well when dealing with limited data. GOPlan improves performance under limited data by using dynamics models and the re - analysis method to generate high - quality imaginary data. - **Unseen goal generalization**: Existing methods are difficult to generalize to unseen goals. GOPlan can better generalize to unseen goals through the combination of Advantage - Weighted CGAN and dynamics models. In summary, by proposing the GOPlan framework, this paper solves the key problems of dealing with limited data and generalizing to unseen goals in offline GCRL, providing an effective solution for offline multi - goal reinforcement learning.

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Beyond Reward: Offline Preference-guided Policy Optimization

Hierarchical Planning Through Goal-Conditioned Offline Reinforcement Learning

Goal-conditioned Offline Planning from Curious Exploration

Provably Efficient Offline Goal-Conditioned Reinforcement Learning with General Function Approximation and Single-Policy Concentrability

Model-Based Offline Planning

Goal-conditioned offline reinforcement learning through state space partitioning

Goal-Conditioned Reinforcement Learning with Disentanglement-based Reachability Planning

Goal-Conditioned Predictive Coding for Offline Reinforcement Learning

Curriculum Goal-Conditioned Imitation for Offline Reinforcement Learning

Planning, Fast and Slow: Online Reinforcement Learning with Action-Free Offline Data Via Multiscale Planners

Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

GenPlan: Generative sequence models as adaptive planners

SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning

What is Essential for Unseen Goal Generalization of Offline Goal-conditioned RL?

Offline Policy Learning via Skill-step Abstraction for Long-horizon Goal-Conditioned Tasks

Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

Goal Agnostic Learning and Planning without Reward Functions