Abstract:Real-world system control requires both high-performing and interpretable controllers. Model-based control policies have gained popularity by using historical data to learn system costs and dynamics before implementation. However, this two-phase approach prevents these policies from achieving optimal control as the metrics that we train these models (e.g., mean squared errors) often differ from the actual control system cost. In this paper, we present DiffOP, a Differentiable Optimization-based Policy for optimal control. In the proposed framework, control actions are derived by solving an optimization, where the control cost function and system's dynamics can be parameterized as neural networks. Our key technical innovation lies in developing a hybrid optimization algorithm that combines policy gradients with implicit differentiation through the optimization layer, enabling end-to-end training with the actual cost feedback. Under standard regularity conditions, we prove DiffOP converges to stationary points at a rate of $O(1/K)$. Empirically, DiffOP achieves state-of-the-art performance in both nonlinear control tasks and real-world building control.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of existing model - based control strategies in optimizing the performance of actual control systems. Specifically, the traditional two - stage method (first learning system costs and dynamics through historical data, and then implementing) causes these strategies to be unable to achieve optimal control, because the metrics used to train the models (such as mean - squared error) are often different from the actual control system costs. This makes the models perform poorly in guiding control decisions and minimizing actual operating costs, even if they have high accuracy in predicting past data.
To solve these problems, the authors propose DiffOP (Differentiable Optimization - based Policy), an optimal control framework based on differentiable optimization. Its main innovations are:
1. **Jointly learning cost and dynamic models**: By combining policy gradients and implicit differentiation, DiffOP can be trained end - to - end in a model - agnostic reinforcement learning (RL) environment, directly using actual cost feedback to optimize the control strategy.
2. **Theoretical guarantees**: The authors provide a theoretical analysis of DiffOP converging to a stable point and prove that its convergence rate is \(O(1/K)\), which is the first analysis of the non - asymptotic convergence rate and sample complexity of optimization - based policies in a reinforcement learning environment.
### Formula Summary
- **Optimization control strategy**:
\[
u^\star_{0:H - 1}(x_{\text{init}}; \theta)=\arg\min_{u}\sum_{i = 0}^{H - 1}c(x_i, u_i; \theta_c)+c_H(x_H; \theta_H)
\]
Subject to the constraints:
\[
x_0 = x_{\text{init}}, \quad x_{i + 1}=f(x_i, u_i; \theta_f), \quad g(x_i, u_i)\leq0
\]
- **Policy optimization problem**:
\[
\min_\theta C(\theta):=\mathbb{E}\left[\sum_{t = 0}^T c(x_t, u_t; \phi_c)\right]
\]
Subject to the constraints:
\[
x_{t + 1}=f(x_t, u_t; \phi_f), \quad u_t\sim\pi_\theta(x_t)
\]
- **Policy gradient update rule**:
\[
\nabla_\theta C(\theta)=\mathbb{E}\left[\sum_{t = 0}^T\frac{1}{\sigma^2}[\nabla_\theta u^\star_t]^T(u_t - u^\star_t)\right]
\]
Through these methods, DiffOP not only improves the performance of the control strategy, but also ensures its interpretability and safety, and is suitable for complex real - world system control tasks, such as power systems, industrial infrastructures, transportation networks and robotic systems, etc.