Risk-Sensitive Stochastic Optimal Control as Rao-Blackwellized Markovian Score Climbing

Hany Abdulsamad,Sahel Iqbal,Adrien Corenflos,Simo Särkkä
2023-12-22
Abstract:Stochastic optimal control of dynamical systems is a crucial challenge in sequential decision-making. Recently, control-as-inference approaches have had considerable success, providing a viable risk-sensitive framework to address the exploration-exploitation dilemma. Nonetheless, a majority of these techniques only invoke the inference-control duality to derive a modified risk objective that is then addressed within a reinforcement learning framework. This paper introduces a novel perspective by framing risk-sensitive stochastic control as Markovian score climbing under samples drawn from a conditional particle filter. Our approach, while purely inference-centric, provides asymptotically unbiased estimates for gradient-based policy optimization with optimal importance weighting and no explicit value function learning. To validate our methodology, we apply it to the task of learning neural non-Gaussian feedback policies, showcasing its efficacy on numerical benchmarks of stochastic dynamical systems.
Machine Learning,Systems and Control
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is the **stochastic optimal control problem**, especially the risk - sensitive stochastic optimal control problem in nonlinear, non - Gaussian and bounded domains. Specifically, the author proposes a new method that frames the risk - sensitive stochastic control problem as sample - based Markovian score climbing, with these samples drawn from a conditional particle filter. ### Detailed Interpretation 1. **Problem Background**: - **Stochastic Optimal Control**: Low - level decision - making under uncertainty is a key challenge, with applications ranging from chemical plant control to autonomous driving. - **Existing Methods**: Current methods can be divided into those relying on analytical modeling and local approximation, and data - driven stochastic optimization - type methods. The latter has become increasingly important in recent years due to its success in handling complex problems. 2. **Limitations of Existing Methods**: - Most existing techniques only utilize the inference - control duality to derive modified risk objectives and solve these problems within the reinforcement learning framework. - These methods usually involve Gaussian approximation and heuristic methods, resulting in poor performance in highly nonlinear, non - Gaussian and bounded domains. 3. **Advantages of the New Method**: - **Pure Inference - Centered Method**: The method proposed by the author is entirely based on the inference framework, providing asymptotically unbiased gradient estimates for gradient - based policy optimization without explicit value function learning. - **Avoiding Bias**: By using the Rao - Blackwellized Markov chain, this method can generate unbiased marginal likelihood estimates, thus avoiding the bias problem in traditional methods. 4. **Application Scenarios**: - The author verifies the effectiveness of this method through numerical benchmark tests, especially demonstrating its superiority in learning neural non - Gaussian feedback policies. ### Mathematical Formulas - **Expected Total Cost Criterion**: \[ J(\pi)=\mathbb{E}\left[\sum_{t = 0}^{T}c(x_t,u_t)\right] \] where \(\mathbb{E}\) is the expectation calculated under the state - action trajectory distribution \(p(x_{0:T},u_{0:T})\). - **Joint Density**: \[ p(x_{0:T},u_{0:T},y_{0:T}|\theta)=\delta(x_0)\prod_{t = 0}^{T - 1}f(x_{t+1}|x_t,u_t)\prod_{t = 0}^{T}\pi_\theta(u_t|x_t)\prod_{t = 0}^{T}g(y_t|x_t,u_t) \] - **Log - Marginal Likelihood Maximization**: \[ \arg\max_{\theta\in\Theta}\log p(y_{0:T}=1|\theta)=\arg\min_{\theta\in\Theta}-\frac{1}{\eta}\log\mathbb{E}_{p_\theta}\left[\exp\left(-\eta\sum_{t = 0}^{T}c(x_t,u_t)\right)\right] \] ### Conclusion This paper proposes a new method that addresses the limitations of existing methods in nonlinear, non - Gaussian and bounded domains by transforming the risk - sensitive stochastic control problem into sample - based Markovian score climbing. This method provides asymptotically unbiased gradient estimates, is suitable for learning neural non - Gaussian feedback policies, and shows superior performance in multiple benchmark tests.