Abstract:Stochastic optimal control of dynamical systems is a crucial challenge in sequential decision-making. Recently, control-as-inference approaches have had considerable success, providing a viable risk-sensitive framework to address the exploration-exploitation dilemma. Nonetheless, a majority of these techniques only invoke the inference-control duality to derive a modified risk objective that is then addressed within a reinforcement learning framework. This paper introduces a novel perspective by framing risk-sensitive stochastic control as Markovian score climbing under samples drawn from a conditional particle filter. Our approach, while purely inference-centric, provides asymptotically unbiased estimates for gradient-based policy optimization with optimal importance weighting and no explicit value function learning. To validate our methodology, we apply it to the task of learning neural non-Gaussian feedback policies, showcasing its efficacy on numerical benchmarks of stochastic dynamical systems.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is the **stochastic optimal control problem**, especially the risk - sensitive stochastic optimal control problem in nonlinear, non - Gaussian and bounded domains. Specifically, the author proposes a new method that frames the risk - sensitive stochastic control problem as sample - based Markovian score climbing, with these samples drawn from a conditional particle filter. ### Detailed Interpretation 1. **Problem Background**: - **Stochastic Optimal Control**: Low - level decision - making under uncertainty is a key challenge, with applications ranging from chemical plant control to autonomous driving. - **Existing Methods**: Current methods can be divided into those relying on analytical modeling and local approximation, and data - driven stochastic optimization - type methods. The latter has become increasingly important in recent years due to its success in handling complex problems. 2. **Limitations of Existing Methods**: - Most existing techniques only utilize the inference - control duality to derive modified risk objectives and solve these problems within the reinforcement learning framework. - These methods usually involve Gaussian approximation and heuristic methods, resulting in poor performance in highly nonlinear, non - Gaussian and bounded domains. 3. **Advantages of the New Method**: - **Pure Inference - Centered Method**: The method proposed by the author is entirely based on the inference framework, providing asymptotically unbiased gradient estimates for gradient - based policy optimization without explicit value function learning. - **Avoiding Bias**: By using the Rao - Blackwellized Markov chain, this method can generate unbiased marginal likelihood estimates, thus avoiding the bias problem in traditional methods. 4. **Application Scenarios**: - The author verifies the effectiveness of this method through numerical benchmark tests, especially demonstrating its superiority in learning neural non - Gaussian feedback policies. ### Mathematical Formulas - **Expected Total Cost Criterion**: \[ J(\pi)=\mathbb{E}\left[\sum_{t = 0}^{T}c(x_t,u_t)\right] \] where \(\mathbb{E}\) is the expectation calculated under the state - action trajectory distribution \(p(x_{0:T},u_{0:T})\). - **Joint Density**: \[ p(x_{0:T},u_{0:T},y_{0:T}|\theta)=\delta(x_0)\prod_{t = 0}^{T - 1}f(x_{t+1}|x_t,u_t)\prod_{t = 0}^{T}\pi_\theta(u_t|x_t)\prod_{t = 0}^{T}g(y_t|x_t,u_t) \] - **Log - Marginal Likelihood Maximization**: \[ \arg\max_{\theta\in\Theta}\log p(y_{0:T}=1|\theta)=\arg\min_{\theta\in\Theta}-\frac{1}{\eta}\log\mathbb{E}_{p_\theta}\left[\exp\left(-\eta\sum_{t = 0}^{T}c(x_t,u_t)\right)\right] \] ### Conclusion This paper proposes a new method that addresses the limitations of existing methods in nonlinear, non - Gaussian and bounded domains by transforming the risk - sensitive stochastic control problem into sample - based Markovian score climbing. This method provides asymptotically unbiased gradient estimates, is suitable for learning neural non - Gaussian feedback policies, and shows superior performance in multiple benchmark tests.

Risk-Sensitive Stochastic Optimal Control as Rao-Blackwellized Markovian Score Climbing

Risk-Averse Control of Markov Systems with Value Function Learning

On Average Risk-sensitive Markov Control Processes

Risk Aware Minimum Principle for Optimal Control of Stochastic Differential Equations

Stochastic Optimal Control as Approximate Input Inference

Risk-sensitive Markov control processes

Near Optimal Control for a Class of Stochastic Hybrid Systems.

Linear Quadratic Control with Risk Constraints

Constrained stochastic optimal control with learned importance sampling: A path integral approach

A Neural Network Approach for Stochastic Optimal Control

A Multilevel Approach for Stochastic Nonlinear Optimal Control

Stochastic Data-Driven Predictive Control: Chance-Constraint Satisfaction with Identified Multi-step Predictors

Estimation and Control Using Sampling-Based Bayesian Reinforcement Learning

Risk-averse risk-constrained optimal control

Safe Non-Stochastic Control of Control-Affine Systems: An Online Convex Optimization Approach

Empirical risk minimization for risk-neutral composite optimal control with applications to bang-bang control

Risk‐sensitive maximum principle for stochastic optimal control of mean‐field type Markov regime‐switching jump‐diffusion systems

Connecting Stochastic Optimal Control and Reinforcement Learning

RAT iLQR: A Risk Auto-Tuning Controller to Optimally Account for Stochastic Model Mismatch

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward SDEs