Recursive Backwards Q-Learning in Deterministic Environments

Jan Diekhoff,Jörn Fischer
2024-04-24
Abstract:Reinforcement learning is a popular method of finding optimal solutions to complex problems. Algorithms like Q-learning excel at learning to solve stochastic problems without a model of their environment. However, they take longer to solve deterministic problems than is necessary. Q-learning can be improved to better solve deterministic problems by introducing such a model-based approach. This paper introduces the recursive backwards Q-learning (RBQL) agent, which explores and builds a model of the environment. After reaching a terminal state, it recursively propagates its value backwards through this model. This lets each state be evaluated to its optimal value without a lengthy learning process. In the example of finding the shortest path through a maze, this agent greatly outperforms a regular Q-learning agent.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the issue of reinforcement learning in deterministic environments, specifically focusing on the shortcomings of the Q-learning algorithm in solving such problems. Although the traditional Q-learning algorithm is suitable for solving stochastic problems, it converges slowly in deterministic environments because it lacks effective utilization of the environment model. The paper proposes a new algorithm—Recursive Backwards Q-Learning (RBQL), which aims to find the optimal policy more quickly by constructing an environment model and backpropagating values after reaching terminal states. Specifically, the working principle of the RBQL algorithm is as follows: 1. **Exploration and Modeling**: The RBQL agent constructs an environment model during the exploration process. 2. **Backwards Value Propagation**: When a terminal state is reached, the algorithm traverses the explored states backwards and updates the value of each state according to the recursive backwards Q-learning update rules. 3. **Improved Learning Rule**: By setting the learning rate to 1, the Q-learning update formula is simplified, making the value of each state directly dependent on the reward and the discounted reward of the best neighbor. The paper also mentions the specific implementation details of the RBQL algorithm, including the use of the Godot game engine for simulation experiments, and how to handle the balance between exploration and exploitation. Additionally, the paper compares the performance of the RBQL algorithm with the standard Q-learning algorithm in maze tasks of different sizes through experiments. The experimental results show that the RBQL algorithm outperforms the Q-learning algorithm in all test cases. Particularly in larger mazes, the RBQL algorithm demonstrates significant advantages, not only requiring fewer average steps but also exhibiting more stable performance. As the maze size increases, the advantage of the RBQL algorithm over the Q-learning algorithm becomes more apparent. Especially in solving larger mazes, the RBQL algorithm can find the shortest path in fewer steps, whereas the Q-learning algorithm requires more exploratory steps.