Abstract:Direct data-driven design methods for the linear quadratic regulator (LQR) mainly use offline or episodic data batches, and their online adaptation has been acknowledged as an open problem. In this paper, we propose a direct adaptive method to learn the LQR from online closed-loop data. First, we propose a new policy parameterization based on the sample covariance to formulate a direct data-driven LQR problem, which is shown to be equivalent to the certainty-equivalence LQR with optimal non-asymptotic guarantees. Second, we design a novel data-enabled policy optimization (DeePO) method to directly update the policy, where the gradient is explicitly computed using only a batch of persistently exciting (PE) data. Third, we establish its global convergence via a projected gradient dominance property. Importantly, we efficiently use DeePO to adaptively learn the LQR by performing only one-step projected gradient descent per sample of the closed-loop system, which also leads to an explicit recursive update of the policy. Under PE inputs and for bounded noise, we show that the average regret of the LQR cost is upper-bounded by two terms signifying a sublinear decrease in time $\mathcal{O}(1/\sqrt{T})$ plus a bias scaling inversely with signal-to-noise ratio (SNR), which are independent of the noise statistics. Finally, we perform simulations to validate the theoretical results and demonstrate the computational and sample efficiency of our method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the direct data - driven method for online adaptive learning of the linear quadratic regulator (LQR). Specifically, the existing direct data - driven design methods mainly use offline or piecewise data batches, while the research on online adaptive learning is still insufficient. This paper proposes a new direct adaptive method, aiming to learn LQR from online closed - loop data, and solves the following key problems: 1. **Direct data - driven LQR problem**: - A new strategy parameterization method based on sample covariance is proposed for constructing the direct data - driven LQR problem. - It is proved that this new method is equivalent to the certainty - equivalent LQR with optimal non - asymptotic guarantees. 2. **Data - driven policy optimization (DeePO)**: - A novel data - driven policy optimization method (DeePO) is designed, which can directly calculate the gradient from a set of persistently exciting (PE) data to update the policy. - Its global convergence is established and verified by the projected - gradient - dominance property. 3. **Efficient online adaptive learning**: - An efficient DeePO method that only needs to perform one projected - gradient - descent at each step is proposed to recursively update the policy. - Under the conditions of persistently exciting input and bounded noise, it is proved that the average regret upper bound of the LQR cost is $O\left(\frac{1}{\sqrt{T}}\right)$ and decreases as the signal - to - noise ratio (SNR) increases. 4. **Non - asymptotic performance guarantees**: - Non - asymptotic performance guarantees of DeePO in adaptive learning of LQR are provided, which are independent of the noise statistical characteristics. - The sub - linear convergence rate of DeePO compared with the single - batch method is shown, indicating its higher sample efficiency. 5. **Simulation verification**: - The theoretical results are verified by simulation, and the superiority of DeePO in terms of computational and sample efficiency is demonstrated. In conclusion, this paper aims to fill the gap in the direct data - driven method for online adaptive learning of LQR and provides an efficient method with theoretical guarantees.

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Data-enabled Policy Optimization for the Linear Quadratic Regulator

Linear Convergence of Data-Enabled Policy Optimization for Linear Quadratic Tracking

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Stability-Certified On-Policy Data-Driven LQR via Recursive Learning and Policy Gradient

Data-Driven LQR using Reinforcement Learning and Quadratic Neural Networks

Data-Driven LQR with Finite-Time Experiments via Extremum-Seeking Policy Iteration

Direct Adaptive Control of Grid-Connected Power Converters via Output-Feedback Data-Enabled Policy Optimization

Direct Data-Driven Discounted Infinite Horizon Linear Quadratic Regulator with Robustness Guarantees

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Policy Gradient Methods for the Cost-Constrained LQR: Strong Duality and Global Convergence

Model-Free Design of Stochastic LQR Controller from Reinforcement Learning and Primal-Dual Optimization Perspective

An Efficient Off-Policy Reinforcement Learning Algorithm for the Continuous-Time LQR Problem

Accelerated Optimization Landscape of Linear-Quadratic Regulator

On the Certainty-Equivalence Approach to Direct Data-Driven LQR Design

Asynchronous Parallel Policy Gradient Methods for the Linear Quadratic Regulator

Fast Policy Learning for Linear Quadratic Control with Entropy Regularization

Sample Complexity of the Linear Quadratic Regulator: A Reinforcement Learning Lens

On the Optimization Landscape of Dynamic Output Feedback: A Case Study for Linear Quadratic Regulator

Asynchronous Heterogeneous Linear Quadratic Regulator Design