Abstract:The first algorithm for the Linear Quadratic (LQ) control problem with an unknown system model, featuring a regret of $\mathcal{O}(\sqrt{T})$, was introduced by Abbasi-Yadkori and Szepesvári (2011). Recognizing the computational complexity of this algorithm, subsequent efforts (see Cohen et al. (2019), Mania et al. (2019), Faradonbeh et al. (2020a), and Kargin et al.(2022)) have been dedicated to proposing algorithms that are computationally tractable while preserving this order of regret. Although successful, the existing works in the literature lack a fully adaptive exploration-exploitation trade-off adjustment and require a user-defined value, which can lead to overall regret bound growth with some factors. In this work, noticing this gap, we propose the first fully adaptive algorithm that controls the number of policy updates (i.e., tunes the exploration-exploitation trade-off) and optimizes the upper-bound of regret adaptively. Our proposed algorithm builds on the SDP-based approach of Cohen et al. (2019) and relaxes its need for a horizon-dependant warm-up phase by appropriately tuning the regularization parameter and adding an adaptive input perturbation. We further show that through careful exploration-exploitation trade-off adjustment there is no need to commit to the widely-used notion of strong sequential stability, which is restrictive and can introduce complexities in initialization.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve several key challenges in the linear - quadratic (LQ) control problem. Specifically, the paper focuses on how to design a fully adaptive algorithm to achieve the optimal regret bound \(O(\sqrt{T})\) when the system model is unknown. The main problems include:
1. **Balance between exploration and exploitation**:
- Existing methods are insufficient in terms of the balance between exploration and exploitation. Frequent policy updates may lead to system instability and increase the regret bound. Conversely, too low an update frequency fails to effectively optimize performance. The paper proposes a new adaptive algorithm that can dynamically adjust the balance between exploration and exploitation, thereby optimizing the regret bound.
2. **Computational complexity**:
- Although earlier methods achieved a regret bound of \(O(\sqrt{T})\), their computational complexity was high. By introducing adaptive input perturbations and adjusting the regularization parameter, the paper proposes a computationally more efficient algorithm while maintaining the \(O(\sqrt{T})\) regret bound.
3. **Dependence on the initialization phase**:
- Existing methods usually require an initialization phase that depends on the time horizon, which limits the adaptability of the algorithm. By adjusting the regularization parameter and input perturbation, the paper eliminates the dependence on the initialization phase, enabling the algorithm to run without knowing the time horizon.
4. **Strong sequential stability**:
- Some existing methods rely on the concept of strong sequential stability, which may be too strict in practical applications. By adjusting the balance between exploration and exploitation, the paper proves that there is no need to rely on strong sequential stability, thus simplifying the design and analysis of the algorithm.
### Main contributions
1. **Adaptive balance between exploration and exploitation**:
- A new adaptive algorithm is proposed. By selecting an appropriate criterion \(\det(V_t)\geq(1 + \beta_\tau)\det(V_\tau)\) and a monotonically decreasing function \(\beta_\tau\), the frequency of policy updates is optimized, thereby improving the regret bound.
2. **Fully adaptive convex SDP method**:
- Based on the OSLO algorithm proposed by Cohen et al. (2019), a fully adaptive convex SDP method is proposed by relaxing its constraints. This method achieves an \(O(\sqrt{T})\) regret bound without the need for a predefined time horizon.
3. **Elimination of strong sequential stability**:
- By adjusting the balance between exploration and exploitation, it is proved that there is no need to rely on strong sequential stability, thus simplifying the design and analysis of the algorithm. The paper proposes a new algorithm that can still ensure the boundedness of the state norm even when the initial estimates are not so precise.
### Conclusion
By proposing a new adaptive algorithm, the paper addresses the shortcomings of existing methods in terms of the balance between exploration and exploitation, computational complexity, dependence on the initialization phase, and strong sequential stability. The algorithm improves computational efficiency and adaptability while maintaining the \(O(\sqrt{T})\) regret bound.