Reinforcement Learning from Demonstration and Human Reward

Guangliang Li
2016-01-01
Abstract:In this paper, we proposed a model-based method—IRL-TAMER— for combining learning from demonstration via inverse reinforcement learning (IRL) and learning from human reward via the TAMER framework. We tested our method in the Grid World domain and compared with the TAMER framework using different discount factors on human reward. Our results suggest that with one demonstration, although an agent learning via IRL cannot obtain an effective policy navigating to the goal state, it can still learn a useful value function indicating what states are good based on the demonstration. More importantly, learning from demonstration can reduce the number of human rewards needed to obtain an optimal policy, especially the number of negative feedback. That is to say, learning from demonstration can be a jump-start for agent’s learning from human reward and reduce the number of mistakes— incorrect actions. Furthermore, our results show that learning from demonstration can only be useful for agent’s learning from human reward when the discount rate is high, i.e., learning from myopic human reward.
What problem does this paper attempt to address?