Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Adrien Bolland,Gaspard Lambrechts,Damien Ernst
2024-12-10
Abstract:We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the exploration problem in Reinforcement Learning (RL), especially how to effectively explore the state - action space to improve the performance of policies. Specifically, the author introduces a new Maximum Entropy Reinforcement Learning (MaxEntRL) framework to enhance exploration through future state and action visitation measures. ### Main Problems 1. **Limitations of Existing Methods**: - Existing maximum - entropy reinforcement learning methods usually only reward the randomness of actions, ignoring the influence of policies on visited states, which may lead to low exploration efficiency. - Many existing algorithms are on - policy and need to sample new trajectories from the environment every time the policy is updated. They cannot use buffers of arbitrary transitions or be applied in batch - mode RL. 2. **Balance between Exploration and Control Policies**: - How to design an algorithm that can efficiently explore the state - action space and calculate high - performance control policies. ### Solutions The author proposes a new MaxEntRL framework, and its core innovations include: - **Intrinsic Reward Function**: For each state and action, the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features extracted from them) visited in future time steps. - **Proof of Fixed Point**: It is proved that the visitation distribution used to define the intrinsic reward is a fixed point of a contraction operator, so this distribution can be estimated offline. - **Adaptation to Existing Algorithms**: It is described how to adjust existing RL algorithms to learn this fixed point and calculate the intrinsic reward to enhance exploration. ### Experimental Verification The author verifies the effectiveness of the new method through experiments, especially its exploration ability in sparse - reward environments. The experimental results show that the exploration policy calculated using the new method has good state - action space coverage and can efficiently calculate high - performance control policies. ### Summary The main contribution of this paper lies in introducing a new MaxEntRL framework to enhance exploration through future state and action visitation measures, which solves the limitations of existing methods in exploration efficiency and applicability.