Abstract:SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.

Predictable Interval MDPs through Entropy Regularization

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Unpredictable Planning Under Partial Observability

Relaxed Equilibria for Time-Inconsistent Markov Decision Processes

Essentially Sharp Estimates on the Entropy Regularization Error in Discrete Discounted Markov Decision Processes

The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough

Data-driven Interval MDP for Robust Control Synthesis

Entropy Maximization for Partially Observable Markov Decision Processes

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning.

Tsallis Entropy Regularization for Linearly Solvable MDP and Linear Quadratic Regulator

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs.

Entropy-Regularized Stochastic Games

Conformal Prediction Intervals for Markov Decision Process Trajectories

Entropy-regularized Point-based Value Iteration

A Contracting Dynamical System Perspective toward Interval Markov Decision Processes

Transfer Entropy in MDPs with Temporal Logic Specifications

Robust Deterministic Policies for Markov Decision Processes under Budgeted Uncertainty

Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

State Entropy Optimization in Markov Decision Processes

Accelerating Primal-dual Methods for Regularized Markov Decision Processes

Robust Anytime Learning of Markov Decision Processes