Abstract:Navigation is a fundamental cognitive skill extensively studied in neuroscientific experiments and has lately gained substantial interest in artificial intelligence research. Recreating the task solved by rodents in the well-established Morris Water Maze (MWM) experiment, this work applies a transformer-based architecture using deep reinforcement learning -- an approach previously unexplored in this context -- to navigate a 2D version of the maze. Specifically, the agent leverages a decoder-only transformer architecture serving as a deep Q-network performing effective decision making in the partially observable environment. We demonstrate that the proposed architecture enables the agent to efficiently learn spatial navigation strategies, overcoming challenges associated with a limited field of vision, corresponding to the visual information available to a rodent in the MWM. Demonstrating the potential of transformer-based models for enhancing navigation performance in partially observable environments, this work suggests promising avenues for future research in artificial agents whose behavior resembles that of biological agents. Finally, the flexibility of the transformer architecture in supporting varying input sequence lengths opens opportunities for gaining increased understanding of the artificial agent's inner representation of the environment.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to use the Transformer - based deep reinforcement learning method to achieve spatial navigation in a partially observable environment. Specifically, the author transforms the task in the classic Morris Water Maze (MWM) experiment into a 2D simulation environment and attempts to train an agent through a new Transformer architecture so that it can efficiently find the hidden platform in this environment.
### Problem Background
The Morris Water Maze experiment is a behavioral test widely used to evaluate the spatial learning ability of rodents. In this experiment, rodents are placed in a circular pool filled with opaque water and need to find a hidden platform through visual cues. As the number of trials increases, rodents gradually learn to use distant visual cues to locate the platform.
In the field of artificial intelligence, researchers have shown great interest in solving similar navigation tasks, especially in simulating the behavior of organisms. However, most existing studies assume that the environment is fully observable, which is inconsistent with the actual situation. The field of view of rodents in the MWM experiment is limited. Therefore, in order to more realistically simulate this experiment, researchers need to deal with the challenges in partially observable environments.
### Main Contributions of the Paper
1. **Application of Transformer Architecture**: This study is the first to apply the Transformer architecture to the partially observable Morris Water Maze task. Specifically, a decoder - only Transformer architecture is used as a Deep Q - Network (DQN) for effective decision - making.
2. **Simulation of Circular MWM Environment**: Different from the previous square MWM, this study simulates a circular MWM environment that is closer to the actual experiment, making the experimental results more realistic.
3. **Reflection of Rodents' Visual Experience**: The field of view of the agent is designed to be a limited range similar to that of rodents, ensuring the authenticity of the experimental conditions. In addition, the agent can efficiently learn navigation strategies without auxiliary tasks.
### Method Overview
- **Environment Modeling**: Model the MWM environment as a Partially Observable Markov Decision Process (POMDP), and the agent can only obtain partial observation information.
- **Agent Architecture**: Use a two - layer, 8 - attention - head decoder - only Transformer architecture, and the input is an embedding vector composed of past observation sequences.
- **Training Process**: Adopt the deep Q - learning algorithm to update Q - values by minimizing the Bellman error. The epsilon - greedy strategy is used in the training process to balance exploration and exploitation, and negative rewards are introduced to encourage goal - oriented behavior.
### Results and Analysis
The experimental results show that the agent can learn effective navigation strategies in a relatively short time, especially performing particularly well with longer input sequences. However, there are also some special cases. For example, the agent will fall into repeated oscillating movements in some cases, indicating that the model configuration may need further optimization.
### Conclusion
This study successfully demonstrates the application potential of the Transformer - based deep reinforcement learning method in partially observable environments, especially for simulating the spatial navigation tasks of organisms. Future research can further explore how to optimize the model configuration and how to use Explainable Artificial Intelligence (XAI) techniques to better understand the internal representation and decision - making process of the agent.
### Formula Summary
- **Self - Attention Mechanism**:
\[
\text{Attention}(Q, K, V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V
\]
- **Multi - Head Attention Mechanism**:
\[
\text{MultiHead}(Q, K, V)=\text{Concat}(\text{head}_{1},\dots,\text{head}_{h})W_{O}
\]
where the calculation formula for each head is:
\[
\text{head}_{i}=\text{Attention}(QW_{Q}^{i}, KW_{K}^{i}, VW_{V}^{i})
\]
- **Q - l