Abstract:To improve the efficiency of reinforcement learning (RL), we propose a novel asynchronous federated reinforcement learning (FedRL) framework termed AFedPG, which constructs a global model through collaboration among $N$ agents using policy gradient (PG) updates. To address the challenge of lagged policies in asynchronous settings, we design a delay-adaptive lookahead technique \textit{specifically for FedRL} that can effectively handle heterogeneous arrival times of policy gradients. We analyze the theoretical global convergence bound of AFedPG, and characterize the advantage of the proposed algorithm in terms of both the sample complexity and time complexity. Specifically, our AFedPG method achieves $O(\frac{{\epsilon}^{-2.5}}{N})$ sample complexity for global convergence at each agent on average. Compared to the single agent setting with $O(\epsilon^{-2.5})$ sample complexity, it enjoys a linear speedup with respect to the number of agents. Moreover, compared to synchronous FedPG, AFedPG improves the time complexity from $O(\frac{t_{\max}}{N})$ to $O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$, where $t_{i}$ denotes the time consumption in each iteration at agent $i$, and $t_{\max}$ is the largest one. The latter complexity $O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$ is always smaller than the former one, and this improvement becomes significant in large-scale federated settings with heterogeneous computing powers ($t_{\max}\gg t_{\min}$). Finally, we empirically verify the improved performance of AFedPG in four widely-used MuJoCo environments with varying numbers of agents. We also demonstrate the advantages of AFedPG in various computing heterogeneity scenarios.

Asynchronous Parallel Policy Gradient Methods for the Linear Quadratic Regulator

Asynchronous Block Parallel Policy Optimization for the Linear Quadratic Regulator*

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

Stochastic Cubic-Regularized Policy Gradient Method

Policy Gradient Methods for Discrete Time Linear Quadratic Regulator With Random Parameters

Asynchronous Heterogeneous Linear Quadratic Regulator Design

Convergence of Policy Gradient for Stochastic Linear-Quadratic Control Problem in Infinite Horizon

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

Fast Policy Learning for Linear Quadratic Control with Entropy Regularization

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Structured Policy Iteration for Linear Quadratic Regulator

Convergence of Policy Gradient Methods for Finite-Horizon Exploratory Linear-Quadratic Control Problems

On the Optimization Landscape of Dynamic Output Feedback: A Case Study for Linear Quadratic Regulator

Policy Gradient Methods for the Cost-Constrained LQR: Strong Duality and Global Convergence

Towards Understanding Asynchronous Advantage Actor-critic: Convergence and Linear Speedup

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Learning Stabilizing Controllers of Linear Systems via Discount Policy Gradient

Data-enabled Policy Optimization for the Linear Quadratic Regulator