Abstract:Although in recent years reinforcement learning has become very popular the number of successful applications to different kinds of operations research problems is rather scarce. Reinforcement learning is based on the well-studied dynamic programming technique and thus also aims at finding the best stationary policy for a given Markov Decision Process, but in contrast does not require any model knowledge. The policy is assessed solely on consecutive states (or state-action pairs), which are observed while an agent explores the solution space. The contributions of this paper are manifold. First we provide deep theoretical insights to the widely applied standard discounted reinforcement learning framework, which give rise to the understanding of why these algorithms are inappropriate when permanently provided with non-zero rewards, such as costs or profit. Second, we establish a novel near-Blackwell-optimal reinforcement learning algorithm. In contrary to former method it assesses the average reward per step separately and thus prevents the incautious combination of different types of state values. Thereby, the Laurent Series expansion of the discounted state values forms the foundation for this development and also provides the connection between the two approaches. Finally, we prove the viability of our algorithm on a challenging problem set, which includes a well-studied M/M/1 admission control queuing system. In contrast to standard discounted reinforcement learning our algorithm infers the optimal policy on all tested problems. The insights are that in the operations research domain machine learning techniques have to be adapted and advanced to successfully apply these methods in our settings.

A Class of Optimal Control Problem for Stochastic Discrete-Time Systems with Average Reward Reinforcement Learning.

Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.

Stochastic Optimal Control of Quasi Non-Integrable Hamiltonian Systems with Stochastic Maximum Principle

Average Cost Optimal Control of Stochastic Systems Using Reinforcement Learning

Study on an Average Reward Reinforcement Learning Algorithm

Near Optimal Control for a Class of Stochastic Hybrid Systems.

Optimal Control of Ergodic Continuous-Time Markov Chains with Average Sample-Path Rewards

Long Run Stochastic Control Problems with General Discounting

Reinforcement Learning for Adaptive Optimal Stationary Control of Linear Stochastic Systems

Adaptive Optimal Control of Discrete-Time Linear Systems with Discounted Value: Off-Policy Reinforcement Learning

Hybrid Reinforcement Learning for Optimal Control of Non-Linear Switching System

Learning Optimal Control Policy for Unknown Discrete-Time Systems

Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

Optimal Tracking Control for Non-Zero-sum Games of Linear Discrete-Time Systems Via Off-Policy Reinforcement Learning

NN Reinforcement Learning Adaptive Control for a Class of Nonstrict-Feedback Discrete-Time Systems

Average Reward Reinforcement Learning For Semi-Markov Decision Processes

Average Optimality For Continuous-Time Markov Decision Processes In Polish Spaces

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Randomized Optimal Stopping Problem in Continuous time and Reinforcement Learning Algorithm

Fuzzy-Based Adaptive Optimization of Unknown Discrete-Time Nonlinear Markov Jump Systems With Off-Policy Reinforcement Learning