Abstract:We study continuous-time mean--variance portfolio selection in markets where stock prices are diffusion processes driven by observable factors that are also diffusion processes yet the coefficients of these processes are unknown. Based on the recently developed reinforcement learning (RL) theory for diffusion processes, we present a general data-driven RL algorithm that learns the pre-committed investment strategy directly without attempting to learn or estimate the market coefficients. For multi-stock Black--Scholes markets without factors, we further devise a baseline algorithm and prove its performance guarantee by deriving a sublinear regret bound in terms of Sharpe ratio. For performance enhancement and practical implementation, we modify the baseline algorithm into four variants, and carry out an extensive empirical study to compare their performance, in terms of a host of common metrics, with a large number of widely used portfolio allocation strategies on S\&P 500 constituents. The results demonstrate that the continuous-time RL strategies are consistently among the best especially in a volatile bear market, and decisively outperform the model-based continuous-time counterparts by significant margins.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the continuous trading market, how to use the Reinforcement Learning (RL) method to select the optimal portfolio to achieve the mean - variance (MV) efficient allocation. Specifically, the paper focuses on how an investor, who can only observe stock prices and market factors but has no knowledge of the specific model parameters of the market, can directly learn the optimal investment strategy from data through the RL algorithm without attempting to estimate or learn market coefficients. ### Core Problems of the Paper 1. **Dynamic Mean - Variance Portfolio Selection**: Traditionally, the mean - variance framework is mainly used for static (single - period) portfolio selection, and applying these static strategies in a dynamic environment is inefficient. This paper studies how to perform dynamic mean - variance portfolio selection in continuous time. 2. **Unknown Market Coefficients**: Most existing methods rely on accurate estimation of asset return moments, especially the expected return, which is very difficult and error - prone in practice. This paper proposes a method that does not require estimating market coefficients. 3. **Application of Reinforcement Learning**: Using reinforcement learning theory, especially the continuous - time RL theory for diffusion processes, to directly learn the optimal investment strategy from data, thus avoiding the estimation error and sensitivity problems in traditional methods. ### Main Contributions 1. **Proposing RL Algorithms**: Based on the continuous - time RL theories of Wang et al. (2020) and Jia and Zhou (2022a, b), RL algorithms applicable to the MV problem are proposed. These algorithms generate new learning data by solving moment conditions under certain martingale conditions. 2. **Theoretical Guarantees and Regret Analysis**: For the multi - dimensional Black - Scholes environment (without factors), a baseline algorithm is designed, and its convergence and sub - linear regret bound with respect to the Sharpe ratio are proven. This is the first model - free regret analysis for continuous - time MV portfolio selection. 3. **Empirical Research**: Through extensive empirical research on the S&P 500 constituent stocks, the performance of the proposed RL strategy is compared with that of 15 other popular methods. The results show that the RL strategy outperforms the classic model - driven continuous - time methods in all indicators, especially in volatility and bear markets. ### Formula Examples Some of the key formulas involved in the paper include: - Wealth change equation: \[ dx(u(t))=\sum_{i = 1}^{d}u_i(t)\frac{dS_i(t)}{S_i(t)}-e^T u(t)\frac{dS_0(t)}{S_0(t)} \] - Mean - variance optimization problem: \[ \min_u \text{Var}(x(u(T))) \] \[ \text{subject to } E[x(u(T))]=z \] - Entropy - regularized objective function: \[ E\left[-\left(x^{\pi}(T)-w\right)^2+\gamma\int_0^T\log\pi(u^{\pi}(t)|t,x^{\pi}(t),F(t))dt\right]-(w - z)^2 \] Through these methods, the paper aims to provide a more robust and effective solution for dynamic portfolio selection, especially when the market coefficients are unknown.

Mean--Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study

Continuous‐time mean–variance portfolio selection: A reinforcement learning framework

Reinforcement Learning for Continuous-Time Mean-Variance Portfolio Selection in a Regime-Switching Market

Deep Reinforcement Learning and Convex Mean-Variance Optimisation for Portfolio Management

Discrete-Time Mean-Variance Strategy Based on Reinforcement Learning

Continual portfolio selection in dynamic environments via incremental reinforcement learning

Mean-Variance Efficient Reinforcement Learning with Applications to Dynamic Financial Investment

Continuous-Time Mean-Variance Portfolio Selection with Random Horizon

A Deep Reinforcement Learning Approach for Portfolio Management in Non‐Short‐Selling Market

Model-Free Reinforcement Learning for Financial Portfolios: A Brief Survey

A General Framework on Enhancing Portfolio Management with Reinforcement Learning

Uncertainty-Aware Reinforcement Learning for Portfolio Optimization

Learning Merton's Strategies in an Incomplete Market: Recursive Entropy Regularization and Biased Gaussian Exploration

Model based Control of a Continuous Yeast Fermentation

Deep Reinforcement Learning for Stock Portfolio Optimization

Adaptive learning for financial markets mixing model-based and model-free RL for volatility targeting

Explainable Deep Reinforcement Learning for Portfolio Management: An Empirical Approach

Reinforcement Learning for Jump-Diffusions, with Financial Applications

Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task

Continuous-time mean-variance portfolio selection with regime-switching financial market: Time-consistent solution

Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization