Abstract:We study the discrete-time linear-quadratic (LQ) control model using reinforcement learning (RL). Using entropy to measure the cost of exploration, we prove that the optimal feedback policy for the problem must be Gaussian type. Then, we apply the results of the discrete-time LQ model to solve the discrete-time mean-variance asset-liability management problem and prove our RL algorithm's policy improvement and convergence. Finally, a numerical example sheds light on the theoretical results established using simulations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the discrete - time linear - quadratic (LQ) control problem through the reinforcement learning (RL) method. Specifically, the main objectives of the paper include: 1. **Research on the discrete - time LQ control model**: - Using entropy as a measure of exploration cost, prove that the optimal feedback policy must be of the Gaussian type. - Derive the value function under a given policy and prove its form and properties. 2. **Application to the financial field**: - Apply the results of the discrete - time LQ model to solve the discrete - time mean - variance asset - liability management problem. - Prove the policy improvement and convergence of the RL algorithm to ensure that the algorithm can effectively find the optimal solution. 3. **Numerical experiment to verify theoretical results**: - Demonstrate the effectiveness of the theoretical results through numerical simulation and verify the performance of the proposed RL algorithm in the actual market environment. 4. **Explore broader real - world problems**: - Research how to extend the RL method to nonlinear systems to deal with more complex real - world problems, such as bio - data analysis, self - driving, robot control, etc. ### Formula summary To better understand the technical details in the paper, here are some key formulas and their explanations: 1. **Objective function of the discrete - time LQ problem**: \[ V_\pi(0, x_0, y_0) = \min_\pi \mathbb{E} \left[ \begin{pmatrix} x_T \\ y_T \end{pmatrix}^\top Q_T \begin{pmatrix} x_T \\ y_T \end{pmatrix} + \lambda \sum_{t = 0}^{T - 1} \int_{\mathbb{R}^m} \pi_t(u) \ln \pi_t(u) \, du \right] \] where $\lambda$ is the temperature parameter, which is used to measure the trade - off between exploration and exploitation. 2. **Optimal value function**: \[ J^*(t, x_t, y_t) = \begin{pmatrix} x_t \\ y_t \end{pmatrix}^\top P_t \begin{pmatrix} x_t \\ y_t \end{pmatrix} + \frac{\lambda}{2} \sum_{k = t}^{T - 1} \ln \left( \left( \frac{1}{\pi \lambda} \right)^m |G_k| \right) \] where $P_t = F_t - H_t G^{-1}_t H_t^\top$. 3. **Density function of the optimal feedback control**: \[ \pi^*_t(u) = \mathcal{N} \left( -G^{-1}_t H_t^\top \begin{pmatrix} x_t \\ y_t \end{pmatrix}, \frac{\lambda}{2} G^{-1}_t \right) \] These formulas show how to solve the discrete - time LQ control problem through the RL method and apply it to the mean - variance asset - liability management problem in the financial field.

Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

Policy Iteration Reinforcement Learning Method for Continuous-Time Linear-Quadratic Mean-Field Control Problems

Reinforcement Learning-Based Control for Nonlinear Discrete-Time Systems with Unknown Control Directions and Control Constraints

Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

Fast Policy Learning for Linear Quadratic Control with Entropy Regularization

Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

Robust Reinforcement Learning for Risk-Sensitive Linear Quadratic Gaussian Control

A Reinforcement Learning Method for LQR Control Problem

Robust policy iteration for continuous-time stochastic $H_\infty$ control problem with unknown dynamics

A Tour of Reinforcement Learning: The View from Continuous Control

Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning

Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty

Reinforcement Learning for Jump-Diffusions, with Financial Applications

Continuous‐time mean–variance portfolio selection: A reinforcement learning framework

Reinforcement Learning for Finite-Horizon H∞ Tracking Control of Unknown Discrete Linear Time-Varying System

Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise

A Q-Learning Algorithm for Discrete-Time Linear-Quadratic Control with Random Parameters of Unknown Distribution: Convergence and Stabilization

Reinforcement Learning Policies in Continuous-Time Linear Systems

Learning the Linear Quadratic Regulator from Nonlinear Observations