Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

Lucky Li
2024-12-08
Abstract:We study the discrete-time linear-quadratic (LQ) control model using reinforcement learning (RL). Using entropy to measure the cost of exploration, we prove that the optimal feedback policy for the problem must be Gaussian type. Then, we apply the results of the discrete-time LQ model to solve the discrete-time mean-variance asset-liability management problem and prove our RL algorithm's policy improvement and convergence. Finally, a numerical example sheds light on the theoretical results established using simulations.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the discrete - time linear - quadratic (LQ) control problem through the reinforcement learning (RL) method. Specifically, the main objectives of the paper include: 1. **Research on the discrete - time LQ control model**: - Using entropy as a measure of exploration cost, prove that the optimal feedback policy must be of the Gaussian type. - Derive the value function under a given policy and prove its form and properties. 2. **Application to the financial field**: - Apply the results of the discrete - time LQ model to solve the discrete - time mean - variance asset - liability management problem. - Prove the policy improvement and convergence of the RL algorithm to ensure that the algorithm can effectively find the optimal solution. 3. **Numerical experiment to verify theoretical results**: - Demonstrate the effectiveness of the theoretical results through numerical simulation and verify the performance of the proposed RL algorithm in the actual market environment. 4. **Explore broader real - world problems**: - Research how to extend the RL method to nonlinear systems to deal with more complex real - world problems, such as bio - data analysis, self - driving, robot control, etc. ### Formula summary To better understand the technical details in the paper, here are some key formulas and their explanations: 1. **Objective function of the discrete - time LQ problem**: \[ V_\pi(0, x_0, y_0) = \min_\pi \mathbb{E} \left[ \begin{pmatrix} x_T \\ y_T \end{pmatrix}^\top Q_T \begin{pmatrix} x_T \\ y_T \end{pmatrix} + \lambda \sum_{t = 0}^{T - 1} \int_{\mathbb{R}^m} \pi_t(u) \ln \pi_t(u) \, du \right] \] where $\lambda$ is the temperature parameter, which is used to measure the trade - off between exploration and exploitation. 2. **Optimal value function**: \[ J^*(t, x_t, y_t) = \begin{pmatrix} x_t \\ y_t \end{pmatrix}^\top P_t \begin{pmatrix} x_t \\ y_t \end{pmatrix} + \frac{\lambda}{2} \sum_{k = t}^{T - 1} \ln \left( \left( \frac{1}{\pi \lambda} \right)^m |G_k| \right) \] where $P_t = F_t - H_t G^{-1}_t H_t^\top$. 3. **Density function of the optimal feedback control**: \[ \pi^*_t(u) = \mathcal{N} \left( -G^{-1}_t H_t^\top \begin{pmatrix} x_t \\ y_t \end{pmatrix}, \frac{\lambda}{2} G^{-1}_t \right) \] These formulas show how to solve the discrete - time LQ control problem through the RL method and apply it to the mean - variance asset - liability management problem in the financial field.