Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Feiran Zhao,Florian Dörfler,Alessandro Chiuso,Keyou You
2024-10-04
Abstract:Direct data-driven design methods for the linear quadratic regulator (LQR) mainly use offline or episodic data batches, and their online adaptation has been acknowledged as an open problem. In this paper, we propose a direct adaptive method to learn the LQR from online closed-loop data. First, we propose a new policy parameterization based on the sample covariance to formulate a direct data-driven LQR problem, which is shown to be equivalent to the certainty-equivalence LQR with optimal non-asymptotic guarantees. Second, we design a novel data-enabled policy optimization (DeePO) method to directly update the policy, where the gradient is explicitly computed using only a batch of persistently exciting (PE) data. Third, we establish its global convergence via a projected gradient dominance property. Importantly, we efficiently use DeePO to adaptively learn the LQR by performing only one-step projected gradient descent per sample of the closed-loop system, which also leads to an explicit recursive update of the policy. Under PE inputs and for bounded noise, we show that the average regret of the LQR cost is upper-bounded by two terms signifying a sublinear decrease in time $\mathcal{O}(1/\sqrt{T})$ plus a bias scaling inversely with signal-to-noise ratio (SNR), which are independent of the noise statistics. Finally, we perform simulations to validate the theoretical results and demonstrate the computational and sample efficiency of our method.
Optimization and Control,Systems and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the direct data - driven method for online adaptive learning of the linear quadratic regulator (LQR). Specifically, the existing direct data - driven design methods mainly use offline or piecewise data batches, while the research on online adaptive learning is still insufficient. This paper proposes a new direct adaptive method, aiming to learn LQR from online closed - loop data, and solves the following key problems: 1. **Direct data - driven LQR problem**: - A new strategy parameterization method based on sample covariance is proposed for constructing the direct data - driven LQR problem. - It is proved that this new method is equivalent to the certainty - equivalent LQR with optimal non - asymptotic guarantees. 2. **Data - driven policy optimization (DeePO)**: - A novel data - driven policy optimization method (DeePO) is designed, which can directly calculate the gradient from a set of persistently exciting (PE) data to update the policy. - Its global convergence is established and verified by the projected - gradient - dominance property. 3. **Efficient online adaptive learning**: - An efficient DeePO method that only needs to perform one projected - gradient - descent at each step is proposed to recursively update the policy. - Under the conditions of persistently exciting input and bounded noise, it is proved that the average regret upper bound of the LQR cost is \(O\left(\frac{1}{\sqrt{T}}\right)\) and decreases as the signal - to - noise ratio (SNR) increases. 4. **Non - asymptotic performance guarantees**: - Non - asymptotic performance guarantees of DeePO in adaptive learning of LQR are provided, which are independent of the noise statistical characteristics. - The sub - linear convergence rate of DeePO compared with the single - batch method is shown, indicating its higher sample efficiency. 5. **Simulation verification**: - The theoretical results are verified by simulation, and the superiority of DeePO in terms of computational and sample efficiency is demonstrated. In conclusion, this paper aims to fill the gap in the direct data - driven method for online adaptive learning of LQR and provides an efficient method with theoretical guarantees.