Abstract:We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the exploration problem in Reinforcement Learning (RL), especially how to effectively explore the state - action space to improve the performance of policies. Specifically, the author introduces a new Maximum Entropy Reinforcement Learning (MaxEntRL) framework to enhance exploration through future state and action visitation measures. ### Main Problems 1. **Limitations of Existing Methods**: - Existing maximum - entropy reinforcement learning methods usually only reward the randomness of actions, ignoring the influence of policies on visited states, which may lead to low exploration efficiency. - Many existing algorithms are on - policy and need to sample new trajectories from the environment every time the policy is updated. They cannot use buffers of arbitrary transitions or be applied in batch - mode RL. 2. **Balance between Exploration and Control Policies**: - How to design an algorithm that can efficiently explore the state - action space and calculate high - performance control policies. ### Solutions The author proposes a new MaxEntRL framework, and its core innovations include: - **Intrinsic Reward Function**: For each state and action, the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features extracted from them) visited in future time steps. - **Proof of Fixed Point**: It is proved that the visitation distribution used to define the intrinsic reward is a fixed point of a contraction operator, so this distribution can be estimated offline. - **Adaptation to Existing Algorithms**: It is described how to adjust existing RL algorithms to learn this fixed point and calculate the intrinsic reward to enhance exploration. ### Experimental Verification The author verifies the effectiveness of the new method through experiments, especially its exploration ability in sparse - reward environments. The experimental results show that the exploration policy calculated using the new method has good state - action space coverage and can efficiently calculate high - performance control policies. ### Summary The main contribution of this paper lies in introducing a new MaxEntRL framework to enhance exploration through future state and action visitation measures, which solves the limitations of existing methods in exploration efficiency and applicability.

Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Fast Rates for Maximum Entropy Exploration

A Max-Min Entropy Framework for Reinforcement Learning

Exploration by Maximizing Renyi Entropy for Reward-Free RL Framework.

Maximum Entropy Reinforcement Learning with Evolution Strategies

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Deterministic Exploration via Stationary Bellman Error Maximization

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Reward-Free Exploration for Reinforcement Learning

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs.

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Exploration Entropy for Reinforcement Learning

Historical Decision-Making Regularized Maximum Entropy Reinforcement Learning

Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning

Maximum Entropy Model-based Reinforcement Learning

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Reward Shaping via Diffusion Process in Reinforcement Learning

Maximum Likelihood Constraint Inference for Inverse Reinforcement Learning