Abstract:In reinforcement learning, a reward function is a priori specified mapping that informs the learning agent how well its current actions and states are performing. From the viewpoint of training, reinforcement learning requires no labeled data and has none of the errors that are induced in supervised learning because responsibility is transferred from the loss function to the reward function. Methods that infer an approximated reward function using observations of demonstrations are termed inverse reinforcement learning or apprenticeship learning. A reward function is generated that reproduces observed behaviors. In previous studies, the reward function is implemented by estimating the maximum likelihood, Bayesian or information theoretic methods. This study proposes an inverse reinforcement learning method that has an approximated reward function as a linear combination of feature expectations, each of which plays a role in a base weak classifier. This approximated reward function is used by the agent to learn a policy, and the resultant behaviors are compared with an expert demonstration. The difference between the behaviors of the agent and those of the expert is measured using defined metrics, and the parameters for the approximated reward function are adjusted using an ensemble fuzzy method that has a boosting classification. After some interleaving iterations, the agent performs similarly to the expert demonstration. A fuzzy method is used to assign credits for the rewards in respect of the most recent decision to the neighboring states. Using the proposed method, the agent approximates the expert behaviors in fewer steps. The results of simulation demonstrate that the proposed method performs well in terms of sampling efficiency.

Hyperbolically-Discounted Reinforcement Learning on Reward-Punishment Framework

Reinforcement Learning with Quasi-Hyperbolic Discounting

Hyperbolic Deep Reinforcement Learning

Self Punishment and Reward Backfill for Deep Q-Learning

A bio-inspired reinforcement learning model that accounts for fast adaptation after punishment

An Ensemble Fuzzy Approach for Inverse Reinforcement Learning

Hierarchical Average Reward Policy Gradient Algorithms

A Dynamic Adjusting Reward Function Method for Deep Reinforcement Learning with Adjustable Parameters

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning

Adaptive Discount Factor for Deep Reinforcement Learning in Continuing Tasks with Uncertainty

Off-Policy Reinforcement Learning with Delayed Rewards

Reward-Punishment Reinforcement Learning with Maximum Entropy

DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning

Reinforcement Learning from Demonstration and Human Reward

Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Learning Fair Policies in Multi-Objective (deep) Reinforcement Learning with Average and Discounted Rewards.