Abstract:It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the projected Bellman equation have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. The paper begins with a brief survey of linear programming approaches to optimal control, leading to a particular over parameterization that lends itself to applications in reinforcement learning. The main conclusions are summarized as follows: (i) The new class of convex Q-learning algorithms is introduced based on the convex relaxation of the Bellman equation. Convergence is established under general conditions, including a linear function approximation for the Q-function. (ii) A batch implementation appears similar to the famed DQN algorithm (one engine behind AlphaZero). It is shown that in fact the algorithms are very different: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins' algorithm with function approximation: (a) it is shown that both seek solutions to the same fixed point equation, and (b) the ODE approximations for the two algorithms coincide, and little is known about the stability of this ODE. These results are obtained for deterministic nonlinear systems with total cost criterion. Many extensions are proposed, including kernel implementation, and extension to MDP models.

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Safe Reinforcement Learning Using Finite-Horizon Gradient-based Estimation

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Constant Stepsize Q-learning: Distributional Convergence, Bias and Extrapolation

Multi-step Reinforcement Learning: A Unifying Algorithm

On Convergence of Gradient Expected Sarsa($λ$)

Provably Efficient Q-learning with Function Approximation Via Distribution Shift Error Checking Oracle

Finite-Time Error Bounds for Greedy-GQ

Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

Posterior Sampling for Competitive RL: Function Approximation and Partial Observation

Gaussian-Mixture-Model Q-Functions for Reinforcement Learning by Riemannian Optimization

Double Q($σ$) and Q($σ, λ$): Unifying Reinforcement Learning Control Algorithms

Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

Adaptive Order Q-learning

Double Successive Over-Relaxation Q-Learning with an Extension to Deep Reinforcement Learning

Regularized Q-Learning with Linear Function Approximation

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation.

Convex Q-Learning, Part 1: Deterministic Optimal Control

Deep Q-learning Sampling Based on Advantages