Abstract:The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ committed actions suffers from an exponential complexity in $m$, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at <a class="link-external link-http" href="http://github.com/galdl/rl_delay_basic" rel="external noopener nofollow">this http URL</a> and <a class="link-external link-http" href="http://github.com/galdl/rl_delay_atari" rel="external noopener nofollow">this http URL</a>.
What problem does this paper attempt to address?
This paper attempts to solve the problem of making decisions in an environment with execution delay. Specifically, the standard Markov Decision Process (MDP) assumes that an action will be executed immediately once it is selected. However, in many practical application scenarios, such as robot operation, cloud computing, and financial trading, this assumption is often unrealistic and may lead to catastrophic failures.
### Core Problems of the Paper
1. **Impact of Execution Delay**:
- In the standard MDP, it is assumed that actions are executed instantaneously, but in reality, there may be delays. For example, in an autonomous vehicle, there may be a delay between the reasoning of the perception module and the execution of an action.
2. **Limitations of Existing Methods**:
- Traditional methods for dealing with delays are usually through state augmentation, that is, adding past action information to the current state. This method will cause the state space to grow exponentially, thereby greatly increasing the computational complexity, especially when the delay is large.
3. **Necessity of Non - stationary Markov Policies**:
- The author proves that in the presence of execution delay, deterministic non - stationary Markov policies are sufficient to achieve optimal performance, while stationary Markov policies are usually sub - optimal.
### Main Contributions of the Paper
1. **Theoretical Analysis**:
- Proposed an MDP framework that includes execution delay and analyzed the impact of delay on policy performance.
- Gave the upper and lower bounds of the complexity of policy iteration in MDP with delay.
2. **New Algorithm**:
- Designed a model - driven algorithm based on Q - learning (Delayed - Q), which can handle tasks with execution delay without relying on state augmentation. Experimental results show that this algorithm performs well in a variety of environments (including tabular, physical, and Atari games), and can still converge to high performance quickly especially in the case of large delays.
3. **Experimental Verification**:
- Verified the effectiveness of the new algorithm through extensive experiments, especially in different types of delay and noisy environments, showing its advantages over traditional methods.
### Summary
This paper aims to solve the problem of making effective decisions in an environment with execution delay, proposes a new theoretical framework and algorithm, avoids the computational complexity problem caused by state augmentation in traditional methods, and proves the importance of non - stationary Markov policies in a delayed environment. Experimental results further verify the effectiveness and robustness of the new method.