Abstract:Although in recent years reinforcement learning has become very popular the number of successful applications to different kinds of operations research problems is rather scarce. Reinforcement learning is based on the well-studied dynamic programming technique and thus also aims at finding the best stationary policy for a given Markov Decision Process, but in contrast does not require any model knowledge. The policy is assessed solely on consecutive states (or state-action pairs), which are observed while an agent explores the solution space. The contributions of this paper are manifold. First we provide deep theoretical insights to the widely applied standard discounted reinforcement learning framework, which give rise to the understanding of why these algorithms are inappropriate when permanently provided with non-zero rewards, such as costs or profit. Second, we establish a novel near-Blackwell-optimal reinforcement learning algorithm. In contrary to former method it assesses the average reward per step separately and thus prevents the incautious combination of different types of state values. Thereby, the Laurent Series expansion of the discounted state values forms the foundation for this development and also provides the connection between the two approaches. Finally, we prove the viability of our algorithm on a challenging problem set, which includes a well-studied M/M/1 admission control queuing system. In contrast to standard discounted reinforcement learning our algorithm infers the optimal policy on all tested problems. The insights are that in the operations research domain machine learning techniques have to be adapted and advanced to successfully apply these methods in our settings.

Average Optimality for Finite Models

On Average Optimality for Non-Stationary Markov Decision Processes in Borel Spaces

Average Optimality in Markov Decision Processes with Unbounded Rewards

Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

Average-Cost MDPs with Infinite State and Action Sets: New Sufficient Conditions for Optimality Inequalities and Equations

Finding Optimal Observation-Based Policies for Constrained POMDPs under the Expected Average Reward Criterion

Finding Optimal Memoryless Policies of POMDPs under the Expected Average Reward Criterion

Finitely additive behavioral strategies: when do they induce an unambiguous expected payoff?

Beyond Average Return in Markov Decision Processes

Beyond discounted returns: Robust Markov decision processes with average and Blackwell optimality

Optimal Sample Complexity for Average Reward Markov Decision Processes

Average-Reward Reinforcement Learning with Trust Region Methods

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Optimal models with maximizing probability of first achieving target value in the preceding stages

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Necessary Optimality Conditions For Average Cost Minimization Problems

Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

Optimal Model Averaging Estimation for Generalized Linear Models and Generalized Linear Mixed-Effects Models

Solution to the risk-sensitive average cost optimality equation in a class of Markov decision processes with finite state space

Off-Policy Average Reward Actor-Critic with Deterministic Policy Search