Study on an Average Reward Reinforcement Learning Algorithm

GAO Yang,ZHOU Ru-Yi,WANG Hao,CAO Zhi-Xin
DOI: https://doi.org/10.3321/j.issn:0254-4164.2007.08.019
2007-01-01
Chinese Journal of Computers
Abstract:A large class of problems of sequence decision making is often modeled as Markov decision process (MDP). The problems whose systems with sojourn times can often be modeled as semi-Markov decision process (SMDP). When the system′s parameters are unknown in advance, reinforcement learning is used to obtain the optimal policies. In this paper, the approximate theorem of average reward reinforcement learning is proven by means of the theory of performance potentials. A novel average reward reinforcement learning algorithm, G-learning, is designed by approximating the value function of performance potentials. G-learning is applied not only in MDP, but also in SMDP. Different from the classical R-learning algorithm, the G-learning algorithm chooses the potential value of a reference state instead of the average performance of a system. In this paper, the G-learning algorithm is tested in an access-control queuing task and a production inventory task, and the experimental results show that G-learning has better learning performance than R-learning and SMART.
What problem does this paper attempt to address?