Abstract:Online control with non-stochastic disturbances and adversarially chosen convex cost functions, referred to as online non-stochastic control, has recently attracted increasing attention. We study online non-stochastic control with partial feedback, where learners can only access partially observed states and partially informed (bandit) costs. The problem setting arises naturally in real-world decision-making applications and strictly generalizes exceptional cases studied disparately by previous works. We propose the first online algorithm for this problem, with an $\tilde{O}(T^{3/4})$ regret competing with the best policy in hindsight, where $T$ denotes the time horizon and the $\tilde{O}(\cdot)$-notation omits the poly-logarithmic factors in $T$. To further enhance the algorithms' robustness to changing environments, we then design a novel method with a two-layer structure to optimize the dynamic regret, a more challenging measure that competes with time-varying policies. Our method is based on the online ensemble framework by treating the controller above as the base learner. On top of that, we design two different meta-combiners to simultaneously handle the unknown variation of environments and the memory issue arising from the online control. We prove that the two resulting algorithms enjoy $\tilde{O}(T^{3/4}(1+P_T)^{1/2})$ and $\tilde{O}(T^{3/4}(1+P_T)^{1/4}+T^{5/6})$ dynamic regret respectively, where $P_T$ measures the environmental non-stationarity. Our results are further extended to unknown transition matrices. Finally, empirical studies in both synthetic linear and simulated nonlinear tasks validate our method's effectiveness, thus supporting the theoretical findings.

Online Policy Optimization in Unknown Nonlinear Systems

Online Control of Unknown Time-Varying Dynamical Systems

Online Non-stochastic Control with Partial Feedback

Online Off-Policy Reinforcement Learning for Optimal Control of Unknown Nonlinear Systems Using Neural Networks

Regret Analysis of Policy Optimization over Submanifolds for Linearly Constrained Online LQG

Online Stackelberg Optimization via Nonlinear Control

Online Policy Optimization for Robust MDP

Learning to Control under Time-Varying Environment

Adaptive Optimal Control for a Class of Continuous-Time Affine Nonlinear Systems with Unknown Internal Dynamics

Non-stationary Online Learning with Memory and Non-stochastic Control

Improved Policy Optimization for Online Imitation Learning

Dynamic Regret of Policy Optimization in Non-stationary Environments

Online Adaptive Optimization Algorithm for Semi-Markov Control Processes

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Learning Based Control Policy and Regret Analysis for Online Quadratic Optimization with Asymmetric Information Structure

Policy Optimization Adaptive Dynamic Programming for Optimal Control of Input-Affine Discrete-Time Nonlinear Systems.

Online Synchronous Approximate Optimal Learning Algorithm for Multi-Player Non-Zero-Sum Games with Unknown Dynamics.

Offline Model-Based Optimization via Policy-Guided Gradient Search

Online Policy Iterative-Based H∞ Optimization Algorithm for a Class of Nonlinear Systems

Policy Gradient Reinforcement Learning for Parameterized Continuous-Time Optimal Control

Online Adaptive Optimal Control for Continuous-Time Nonlinear Systems with Completely Unknown Dynamics.