Abstract:Learning world models offers a promising avenue for goal-conditioned reinforcement learning with sparse rewards. By allowing agents to plan actions or exploratory goals without direct interaction with the environment, world models enhance exploration efficiency. The quality of a world model hinges on the richness of data stored in the agent's replay buffer, with expectations of reasonable generalization across the state space surrounding recorded trajectories. However, challenges arise in generalizing learned world models to state transitions backward along recorded trajectories or between states across different trajectories, hindering their ability to accurately model real-world dynamics. To address these challenges, we introduce a novel goal-directed exploration algorithm, MUN (short for "World Models for Unconstrained Goal Navigation"). This algorithm is capable of modeling state transitions between arbitrary subgoal states in the replay buffer, thereby facilitating the learning of policies to navigate between any "key" states. Experimental results demonstrate that MUN strengthens the reliability of world models and significantly improves the policy's capacity to generalize across new goal settings.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to conduct efficient exploration in a sparse - reward environment in Goal - Conditioned Reinforcement Learning (GCRL). Specifically, the authors focus on how to improve the exploration efficiency of agents in long - time - horizon, sparse - reward environments by learning world models, and enable policies to generalize to new goal settings.
### Problem Background
In GCRL, agents need to perform tasks according to the given goals, and these goals are usually specified by users. Since it is both time - consuming and error - prone to directly design dense reward functions for each task, in practice, rewards are often sparse, and a reward signal is only given when the agent reaches the goal state. However, this sparse - reward mechanism makes exploration during the training process very difficult because it is very difficult for the agent to find an effective path to obtain rewards through random actions.
### Limitations of Existing Methods
To meet this challenge, existing methods such as Hafner et al. (2019a) and Mendonca et al. (2021) proposed Model - Based Reinforcement Learning (MBRL), that is, predicting environmental dynamics by learning a world model. However, these methods have the following limitations:
1. **Insufficient Data Coverage**: The quality of the world model depends on the richness of the data in the replay buffer, but traditional methods are difficult to capture state transitions across different trajectories, resulting in inaccurate modeling of the real - world dynamics.
2. **Poor Generalization Ability**: Existing methods perform poorly when dealing with reverse trajectories or state transitions across trajectories, which limits their adaptability to new environments or new tasks.
### Proposed Method: MUN
To solve the above problems, this paper introduces a new goal - oriented exploration algorithm - MUN (World Models for Unconstrained Goal Navigation). The main contributions of MUN include:
1. **Bidirectional Replay Buffer**: MUN adopts a bidirectional replay buffer, which not only covers a wider observation space but also captures more abundant dynamic transitions. This helps to improve the generalization ability and reliability of the world model.
2. **Key Sub - goal Generation**: MUN proposes a method named DAD (Distinct Action Discovery) to identify key sub - goals, which are the key milestones for completing complex tasks. By training the world model to simulate the unconstrained transitions between these key sub - goals, MUN can develop a model that more accurately captures the task structure and learn a policy that can adapt to new goal scenarios.
3. **Efficient Exploration Strategy**: Compared with methods such as Go - Explore, MUN reduces the need for additional exploration strategy training by replacing the "exploration phase" with another "advancement phase", improving computational efficiency.
### Experimental Results
Experiments show that MUN performs excellently in multiple challenging robot manipulation and navigation environments. In particular, in tasks such as block stacking, block rotation, and pen rotation, its success rate reaches more than 95%, far exceeding other baseline methods. In addition, MUN can effectively use the bidirectional replay buffer to train a more generalized policy, so as to better handle navigation tasks between any sub - goals.
In conclusion, MUN significantly improves the exploration efficiency and policy generalization ability of agents in sparse - reward environments by improving the learning of world models and the identification of key sub - goals.