Abstract:Real-world system control requires both high-performing and interpretable controllers. Model-based control policies have gained popularity by using historical data to learn system costs and dynamics before implementation. However, this two-phase approach prevents these policies from achieving optimal control as the metrics that we train these models (e.g., mean squared errors) often differ from the actual control system cost. In this paper, we present DiffOP, a Differentiable Optimization-based Policy for optimal control. In the proposed framework, control actions are derived by solving an optimization, where the control cost function and system's dynamics can be parameterized as neural networks. Our key technical innovation lies in developing a hybrid optimization algorithm that combines policy gradients with implicit differentiation through the optimization layer, enabling end-to-end training with the actual cost feedback. Under standard regularity conditions, we prove DiffOP converges to stationary points at a rate of $O(1/K)$. Empirically, DiffOP achieves state-of-the-art performance in both nonlinear control tasks and real-world building control.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of existing model - based control strategies in optimizing the performance of actual control systems. Specifically, the traditional two - stage method (first learning system costs and dynamics through historical data, and then implementing) causes these strategies to be unable to achieve optimal control, because the metrics used to train the models (such as mean - squared error) are often different from the actual control system costs. This makes the models perform poorly in guiding control decisions and minimizing actual operating costs, even if they have high accuracy in predicting past data. To solve these problems, the authors propose DiffOP (Differentiable Optimization - based Policy), an optimal control framework based on differentiable optimization. Its main innovations are: 1. **Jointly learning cost and dynamic models**: By combining policy gradients and implicit differentiation, DiffOP can be trained end - to - end in a model - agnostic reinforcement learning (RL) environment, directly using actual cost feedback to optimize the control strategy. 2. **Theoretical guarantees**: The authors provide a theoretical analysis of DiffOP converging to a stable point and prove that its convergence rate is $O(1/K)$, which is the first analysis of the non - asymptotic convergence rate and sample complexity of optimization - based policies in a reinforcement learning environment. ### Formula Summary - **Optimization control strategy**: \[ u^\star_{0:H - 1}(x_{\text{init}}; \theta)=\arg\min_{u}\sum_{i = 0}^{H - 1}c(x_i, u_i; \theta_c)+c_H(x_H; \theta_H) \] Subject to the constraints: \[ x_0 = x_{\text{init}}, \quad x_{i + 1}=f(x_i, u_i; \theta_f), \quad g(x_i, u_i)\leq0 \] - **Policy optimization problem**: \[ \min_\theta C(\theta):=\mathbb{E}\left[\sum_{t = 0}^T c(x_t, u_t; \phi_c)\right] \] Subject to the constraints: \[ x_{t + 1}=f(x_t, u_t; \phi_f), \quad u_t\sim\pi_\theta(x_t) \] - **Policy gradient update rule**: \[ \nabla_\theta C(\theta)=\mathbb{E}\left[\sum_{t = 0}^T\frac{1}{\sigma^2}[\nabla_\theta u^\star_t]^T(u_t - u^\star_t)\right] \] Through these methods, DiffOP not only improves the performance of the control strategy, but also ensures its interpretability and safety, and is suitable for complex real - world system control tasks, such as power systems, industrial infrastructures, transportation networks and robotic systems, etc.

Differentiable Optimization-based Control Policy with Convergence Analysis

Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.

Towards a Theoretical Foundation of Policy Optimization for Learning Control Policies

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies

Model-free Adaptive Dynamic Programming for Optimal Control of Discrete-time Affine Nonlinear System

A Combined Policy Gradient and Q-learning Method for Data-driven Optimal Control Problems

Differentiable Optimal Control via Differential Dynamic Programming

Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces

Globally Convergent Policy Gradient Methods for Linear Quadratic Control of Partially Observed Systems

On the Optimization Landscape of Dynamic Output Feedback Linear Quadratic Control

Direct Optimization Based Compensation Adaptive Robust Control of Nonlinear Systems with State and Input Constraints

Continuous-Time Policy Optimization.

Optimization Landscape of Policy Gradient Methods for Discrete-Time Static Output Feedback

Optimal Learning Output Tracking Control: A Model-Free Policy Optimization Method With Convergence Analysis

Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control

Twin Deterministic Policy Gradient Adaptive Dynamic Programming for Optimal Control of Affine Nonlinear Discrete-time Systems

Pontryagin Differentiable Programming: An End-to-End Learning and Control Framework

Safe Neural Control for Non-Affine Control Systems with Differentiable Control Barrier Functions

On the Global Optimality of Direct Policy Search for Nonsmooth $H_\infty$ Output-Feedback Control