Abstract:SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.

Exploratory mean-variance portfolio selection with Choquet regularizers

Continuous‐time mean–variance portfolio selection: A reinforcement learning framework

Reinforcement Learning for Continuous-Time Mean-Variance Portfolio Selection in a Regime-Switching Market

Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

Learning Merton's Strategies in an Incomplete Market: Recursive Entropy Regularization and Biased Gaussian Exploration

Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty

MQES: Max-Q Entropy Search for Efficient Exploration in Continuous Reinforcement Learning

Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning

Discrete-Time Mean-Variance Strategy Based on Reinforcement Learning

EVOLvE: Evaluating and Optimizing LLMs For Exploration

Promoting Stochasticity for Expressive Policies Via a Simple and Efficient Regularization Method.

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

A new deep reinforcement learning model for dynamic portfolio optimization

Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning

Continuous-Time Path-Dependent Exploratory Mean-Variance Portfolio Construction

Mean-Variance Efficient Reinforcement Learning with Applications to Dynamic Financial Investment

Exploration by Maximizing Renyi Entropy for Reward-Free RL Framework.

Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration

Uncertainty-Aware Reinforcement Learning for Portfolio Optimization

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo