Abstract:Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature. Preliminary numerical results demonstrate the robustness of the aforementioned methods and show they can be competitive with state-of-the-art RL algorithms.

Taylor Expansion Policy Optimization

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Taylor TD-learning

Provably Efficient Exploration in Policy Optimization

Policy Optimization for Continuous Reinforcement Learning

Policy Optimization over General State and Action Spaces

Trust Region-Guided Proximal Policy Optimization

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

Policy Optimization with Model-based Explorations

Reflective Policy Optimization

Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

An Analytical Update Rule for General Policy Optimization

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Towards Applicable Reinforcement Learning: Improving the Generalization and Sample Efficiency with Policy Ensemble.

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

Multi-Path Policy Optimization

FORESEE: Prediction with Expansion-Compression Unscented Transform for Online Policy Optimization

A Stochastic Trust-Region Framework for Policy Optimization

Offline Reinforcement Learning with Closed-Form Policy Improvement Operators

Fractal Landscapes in Policy Optimization