Abstract:In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

Approximation Benefits of Policy Gradient Methods with Aggregated States

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Approximate Newton policy gradient algorithms

Stochastic Cubic-Regularized Policy Gradient Method

Off-Policy Policy Gradient with State Distribution Correction

Policy Gradient Method For Robust Reinforcement Learning

Variance-Reduced Policy Gradient Approaches for Infinite Horizon Average Reward Markov Decision Processes

On the Convergence of Discounted Policy Gradient Methods

A Temporal-Difference Approach to Policy Gradient Estimation

Value-Gradient Iteration with Quadratic Approximate Value Functions

Compatible Gradient Approximations for Actor-Critic Algorithms

Policy Optimization over General State and Action Spaces

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation

Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines

Hessian Aided Policy Gradient

On the Sample Complexity of a Policy Gradient Algorithm with Occupancy Approximation for General Utility Reinforcement Learning

Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets

Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies