Abstract:It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the projected Bellman equation have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. The paper begins with a brief survey of linear programming approaches to optimal control, leading to a particular over parameterization that lends itself to applications in reinforcement learning. The main conclusions are summarized as follows: (i) The new class of convex Q-learning algorithms is introduced based on the convex relaxation of the Bellman equation. Convergence is established under general conditions, including a linear function approximation for the Q-function. (ii) A batch implementation appears similar to the famed DQN algorithm (one engine behind AlphaZero). It is shown that in fact the algorithms are very different: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins' algorithm with function approximation: (a) it is shown that both seek solutions to the same fixed point equation, and (b) the ODE approximations for the two algorithms coincide, and little is known about the stability of this ODE. These results are obtained for deterministic nonlinear systems with total cost criterion. Many extensions are proposed, including kernel implementation, and extension to MDP models.

Stochastic Shortest Path Games and Q-Learning

Stochastic Games with Minimally Bounded Action Costs

Convex Q Learning in a Stochastic Environment: Extended Version

Finite-Time Analysis of Minimax Q-Learning for Two-Player Zero-Sum Markov Games: Switching System Approach

Beyond Strict Competition: Approximate Convergence of Multi Agent Q-Learning Dynamics

Stochastic linear-quadratic differential game with Markovian jumps in an infinite horizon

A tutorial on Zero-sum Stochastic Games

A Multi-Step Minimax Q-learning Algorithm for Two-Player Zero-Sum Markov Games

Discrete-Time LQ Stochastic Two-Person Nonzero-Sum Difference Games with Random Coefficients:~Open-Loop Nash Equilibrium

Minimax Q-learning Control for Linear Systems Using the Wasserstein Metric

Convex-Concave Zero-sum Markov Stackelberg Games

Optimal Control of Robust Team Stochastic Games

A zero-sum hybrid stochastic differential game with switching controls

Neural Q-learning for discrete-time nonlinear zero-sum games with adjustable convergence rate

A unified stochastic approximation framework for learning in games

Stochastic differential games with random coefficients and stochastic hamilton-jacobi-bellman-isaacs equations

Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos

Convex Q-Learning, Part 1: Deterministic Optimal Control

Exploration Analysis in Finite-Horizon Turn-based Stochastic Games.

Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Reinforcement Learning for Multi-Objective and Constrained Markov Decision Processes