Abstract:The first algorithm for the Linear Quadratic (LQ) control problem with an unknown system model, featuring a regret of $\mathcal{O}(\sqrt{T})$, was introduced by Abbasi-Yadkori and Szepesvári (2011). Recognizing the computational complexity of this algorithm, subsequent efforts (see Cohen et al. (2019), Mania et al. (2019), Faradonbeh et al. (2020a), and Kargin et al.(2022)) have been dedicated to proposing algorithms that are computationally tractable while preserving this order of regret. Although successful, the existing works in the literature lack a fully adaptive exploration-exploitation trade-off adjustment and require a user-defined value, which can lead to overall regret bound growth with some factors. In this work, noticing this gap, we propose the first fully adaptive algorithm that controls the number of policy updates (i.e., tunes the exploration-exploitation trade-off) and optimizes the upper-bound of regret adaptively. Our proposed algorithm builds on the SDP-based approach of Cohen et al. (2019) and relaxes its need for a horizon-dependant warm-up phase by appropriately tuning the regularization parameter and adding an adaptive input perturbation. We further show that through careful exploration-exploitation trade-off adjustment there is no need to commit to the widely-used notion of strong sequential stability, which is restrictive and can introduce complexities in initialization.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key challenges in the linear - quadratic (LQ) control problem. Specifically, the paper focuses on how to design a fully adaptive algorithm to achieve the optimal regret bound $O(\sqrt{T})$ when the system model is unknown. The main problems include: 1. **Balance between exploration and exploitation**: - Existing methods are insufficient in terms of the balance between exploration and exploitation. Frequent policy updates may lead to system instability and increase the regret bound. Conversely, too low an update frequency fails to effectively optimize performance. The paper proposes a new adaptive algorithm that can dynamically adjust the balance between exploration and exploitation, thereby optimizing the regret bound. 2. **Computational complexity**: - Although earlier methods achieved a regret bound of $O(\sqrt{T})$, their computational complexity was high. By introducing adaptive input perturbations and adjusting the regularization parameter, the paper proposes a computationally more efficient algorithm while maintaining the $O(\sqrt{T})$ regret bound. 3. **Dependence on the initialization phase**: - Existing methods usually require an initialization phase that depends on the time horizon, which limits the adaptability of the algorithm. By adjusting the regularization parameter and input perturbation, the paper eliminates the dependence on the initialization phase, enabling the algorithm to run without knowing the time horizon. 4. **Strong sequential stability**: - Some existing methods rely on the concept of strong sequential stability, which may be too strict in practical applications. By adjusting the balance between exploration and exploitation, the paper proves that there is no need to rely on strong sequential stability, thus simplifying the design and analysis of the algorithm. ### Main contributions 1. **Adaptive balance between exploration and exploitation**: - A new adaptive algorithm is proposed. By selecting an appropriate criterion $\det(V_t)\geq(1 + \beta_\tau)\det(V_\tau)$ and a monotonically decreasing function $\beta_\tau$, the frequency of policy updates is optimized, thereby improving the regret bound. 2. **Fully adaptive convex SDP method**: - Based on the OSLO algorithm proposed by Cohen et al. (2019), a fully adaptive convex SDP method is proposed by relaxing its constraints. This method achieves an $O(\sqrt{T})$ regret bound without the need for a predefined time horizon. 3. **Elimination of strong sequential stability**: - By adjusting the balance between exploration and exploitation, it is proved that there is no need to rely on strong sequential stability, thus simplifying the design and analysis of the algorithm. The paper proposes a new algorithm that can still ensure the boundedness of the state norm even when the initial estimates are not so precise. ### Conclusion By proposing a new adaptive algorithm, the paper addresses the shortcomings of existing methods in terms of the balance between exploration and exploitation, computational complexity, dependence on the initialization phase, and strong sequential stability. The algorithm improves computational efficiency and adaptability while maintaining the $O(\sqrt{T})$ regret bound.

Fully Adaptive Regret-Guaranteed Algorithm for Control of Linear Quadratic Systems

Adaptive Backstepping Control for a Class of Nonlinear Systems with Non-Triangular Structural Uncertainties.

Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

Learning Decentralized Linear Quadratic Regulators with [math] Regret

On Adaptive Linear-Quadratic Regulators

Almost Surely $\sqrt{T}$ Regret Bound for Adaptive LQR

An Iterative Riccati Algorithm for Online Linear Quadratic Control

Learning Decentralized Linear Quadratic Regulators with $\sqrt{T}$ Regret

Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems

Nonasymptotic Regret Analysis of Adaptive Linear Quadratic Control with Model Misspecification

Regret Analysis of Learning-Based Linear Quadratic Gaussian Control with Additive Exploration

Exponentially Stable Adaptive Optimal Control of Uncertain LTI Systems

Online Linear Quadratic Tracking with Regret Guarantees

Learn and Control while Switching: with Guaranteed Stability and Sublinear Regret

Learning to Control under Time-Varying Environment

Controlling Unknown Linear Dynamics with Almost Optimal Regret

Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator

Robust Adaptive Iterative Learning Control for Discrete‐time Nonlinear Systems with Both Parametric and Nonparametric Uncertainties

Regret Optimal Control for Uncertain Stochastic Systems

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

Optimal Adaptive Control of Linear Stochastic Systems with Quadratic Cost Function