Abstract:The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ committed actions suffers from an exponential complexity in $m$, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at <a class="link-external link-http" href="http://github.com/galdl/rl_delay_basic" rel="external noopener nofollow">this http URL</a> and <a class="link-external link-http" href="http://github.com/galdl/rl_delay_atari" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problem of making decisions in an environment with execution delay. Specifically, the standard Markov Decision Process (MDP) assumes that an action will be executed immediately once it is selected. However, in many practical application scenarios, such as robot operation, cloud computing, and financial trading, this assumption is often unrealistic and may lead to catastrophic failures. ### Core Problems of the Paper 1. **Impact of Execution Delay**: - In the standard MDP, it is assumed that actions are executed instantaneously, but in reality, there may be delays. For example, in an autonomous vehicle, there may be a delay between the reasoning of the perception module and the execution of an action. 2. **Limitations of Existing Methods**: - Traditional methods for dealing with delays are usually through state augmentation, that is, adding past action information to the current state. This method will cause the state space to grow exponentially, thereby greatly increasing the computational complexity, especially when the delay is large. 3. **Necessity of Non - stationary Markov Policies**: - The author proves that in the presence of execution delay, deterministic non - stationary Markov policies are sufficient to achieve optimal performance, while stationary Markov policies are usually sub - optimal. ### Main Contributions of the Paper 1. **Theoretical Analysis**: - Proposed an MDP framework that includes execution delay and analyzed the impact of delay on policy performance. - Gave the upper and lower bounds of the complexity of policy iteration in MDP with delay. 2. **New Algorithm**: - Designed a model - driven algorithm based on Q - learning (Delayed - Q), which can handle tasks with execution delay without relying on state augmentation. Experimental results show that this algorithm performs well in a variety of environments (including tabular, physical, and Atari games), and can still converge to high performance quickly especially in the case of large delays. 3. **Experimental Verification**: - Verified the effectiveness of the new algorithm through extensive experiments, especially in different types of delay and noisy environments, showing its advantages over traditional methods. ### Summary This paper aims to solve the problem of making effective decisions in an environment with execution delay, proposes a new theoretical framework and algorithm, avoids the computational complexity problem caused by state augmentation in traditional methods, and proves the importance of non - stationary Markov policies in a delayed environment. Experimental results further verify the effectiveness and robustness of the new method.

Acting in Delayed Environments with Non-Stationary Markov Policies

Tree Search-Based Policy Optimization under Stochastic Execution Delay

Delays in Reinforcement Learning

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Non-Deterministic Policies in Markovian Decision Processes

Decision Making in Non-Stationary Environments with Policy-Augmented Monte Carlo Tree Search

Twin Delayed Multi-Agent Deep Deterministic Policy Gradient

Variational Delayed Policy Optimization

A delay-robust method for enhanced real-time reinforcement learning

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Online Markov decision processes with policy iteration

Overcoming Delayed Feedback Via Overlook Decision Making

Markov Decision Processes under External Temporal Processes

Decision Making in Non-Stationary Environments with Policy-Augmented Search

A Structure-aware Online Learning Algorithm for Markov Decision Processes

Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov Decision Processes

Performative Reinforcement Learning in Gradually Shifting Environments

A safe exploration approach to constrained Markov decision processes