Stitching Sub-Trajectories with Conditional Diffusion Model for Goal-Conditioned Offline RL

Sungyoon Kim, Yunseon Choi, Daiki E. Matsunaga, Kee-Eung Kim
2024-02-12
Abstract:Offline Goal-Conditioned Reinforcement Learning (Offline GCRL) is an important problem in RL that focuses on acquiring diverse goal-oriented skills solely from pre-collected behavior datasets. In this setting, the reward feedback is typically absent except when the goal is achieved, which makes it difficult to learn policies especially from a finite dataset of suboptimal behaviors. In addition, realistic scenarios involve long-horizon planning, which necessitates the extraction of useful skills within sub-trajectories. Recently, the conditional diffusion model has been shown to be a promising approach to generate high-quality long-horizon plans for RL. However, their practicality for the goal-conditioned setting is still limited due to a number of technical assumptions made by the methods. In this paper, we propose SSD (Sub-trajectory Stitching with Diffusion), a model-based offline GCRL method that leverages the conditional diffusion model to address these limitations. In summary, we use the diffusion model that generates future plans conditioned on the target goal and value, with the target value estimated from the goal-relabeled offline dataset. We report state-of-the-art performance in the standard benchmark set of GCRL tasks, and demonstrate the capability to successfully stitch the segments of suboptimal trajectories in the offline data to generate high-quality plans.
Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main challenges in offline goal - conditioned reinforcement learning (Offline GCRL): 1. **Reward Sparsity**: In offline GCRL, the reward signal is usually sparse, and rewards are only provided when the goal is achieved. This makes it difficult to learn a policy from a limited dataset of sub - optimal behaviors. 2. **Unrealistic Trajectory Generation in Long - Horizon Planning Tasks**: Although existing diffusion - model - based methods show potential in generating high - quality long - horizon plans, they tend to generate unrealistic trajectories, especially in complex environments. To address these challenges, the authors propose a new method called SSD (Sub - trajectory Stitching with Diffusion), which uses a conditional diffusion model to generate sub - trajectories and stitches these sub - trajectories together to achieve the goal. Specifically, SSD solves the problem in the following ways: - **Multi - step Goal Chaining**: Enhance the ability to extract useful skills from sub - optimal trajectory fragments through multi - step goal chaining. This method can more effectively capture and integrate valuable parts, thereby generating high - quality behaviors. - **Value - Conditional Diffusion Model**: Introduce a conditional diffusion model that generates future plans based on the goal and the estimated action values. This design avoids the need for optimal plan length, explicit sub - goals, or hierarchical architectures, improving the realism and practicality of the generated trajectories. ### Core Contributions of the Paper 1. **Innovative Architecture**: Propose the Condition - Prompted - Unet architecture, which combines the Unet structure and Transformer blocks to better capture complex patterns and preserve spatial information, thereby generating more realistic trajectories. 2. **Efficient Training Method**: Solve the problems of reward sparsity and unrealistic trajectory generation by alternately training the goal - conditioned value function and the value - conditional diffusion model. 3. **Excellent Experimental Results**: Demonstrate state - of - the - art performance on standard benchmark datasets such as Maze2D and Multi2D, and also achieve significant results in robotic manipulation tasks (such as the Fetch environment). ### Summary SSD successfully solves the problems of reward sparsity and unrealistic trajectory generation in long - horizon planning tasks in offline GCRL by combining the conditional diffusion model and multi - step goal chaining, providing a new solution for the field of offline reinforcement learning.