Safe Online Convex Optimization with Multi-Point Feedback

Spencer Hutchinson,Mahnoosh Alizadeh
2024-07-16
Abstract:Motivated by the stringent safety requirements that are often present in real-world applications, we study a safe online convex optimization setting where the player needs to simultaneously achieve sublinear regret and zero constraint violation while only using zero-order information. In particular, we consider a multi-point feedback setting, where the player chooses $d + 1$ points in each round (where $d$ is the problem dimension) and then receives the value of the constraint function and cost function at each of these points. To address this problem, we propose an algorithm that leverages forward-difference gradient estimation as well as optimistic and pessimistic action sets to achieve $\mathcal{O}(d \sqrt{T})$ regret and zero constraint violation under the assumption that the constraint function is smooth and strongly convex. We then perform a numerical study to investigate the impacts of the unknown constraint and zero-order feedback on empirical performance.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to simultaneously achieve sub - linear regret and zero - constraint violation in the safe online convex optimization (OCO) setting with unknown constraint conditions while using only zero - order information. Specifically, the researchers considered a multi - point feedback setting, in which the player selects \(d + 1\) points (\(d\) is the problem dimension) in each round and then receives the values of the constraint function and the cost function at each point. The goal is to minimize the cumulative loss while ensuring that all selected points satisfy the constraint conditions and without violating any constraint conditions. ### Problem Background Online convex optimization (OCO) is a sequential decision - making framework, in which in each round \(t\in [T]\), the player selects a vector action \(x_t\), and then observes the loss function \(f_t\). The goal is to minimize the cumulative loss \(\sum_{t = 1}^T f_t(x_t)\). OCO has been widely studied in many practical applications, such as online advertising, network resource allocation, and power systems. However, many real - world applications have strict requirements for safety. For example, in clinical trials and power systems, the constraint conditions may be unknown and cannot be violated. ### Specific Problem The problem proposed in the paper is how to design an algorithm to ensure sub - linear regret and zero - constraint violation in the multi - point feedback setting when the player only knows partial feedback information (i.e., zero - order information). Specifically, the player selects multiple actions (exactly \(d + 1\) actions) in each round and then observes the values of the cost function and the constraint function at each point. Despite the limited information, the player needs to ensure that all selected points satisfy the constraint conditions while balancing the needs of constraint satisfaction and regret minimization. ### Solution To solve the above problem, the researchers proposed the algorithm MP - ROGD (Multi - Point Restrained Online Gradient Descent). This algorithm combines online convex optimization under multi - point feedback (Agarwal et al., 2010) and the concepts of optimistic and pessimistic action sets (Hutchinson and Alizadeh, 2024). The specific steps are as follows: 1. **Gradient Estimation**: Since the player does not have direct access to gradient information, the algorithm estimates the gradient by the forward - difference method. 2. **Optimistic and Pessimistic Action Sets**: The algorithm maintains two sets, the optimistic action set \(Y^o_t\) and the pessimistic action set \(Y^p_t\), which are used to approximate the upper and lower bounds of the feasible set respectively. 3. **Update Rule**: The algorithm updates the optimistic iteration point \(\tilde{x}_{t + 1}\) on the optimistic action set by gradient descent, and then moves the actual iteration point \(x_{t + 1}\) towards the optimistic iteration point while ensuring that it is within the pessimistic action set. ### Theoretical Guarantee The researchers proved that under appropriate parameter selection, the MP - ROGD algorithm can achieve a regret of \(O(d\sqrt{T})\) and never violate the constraint conditions. Specifically, if appropriate step sizes \(\eta\) and perturbation parameters \(\delta\) are selected, the algorithm can ensure that all selected actions are within the feasible set. ### Experimental Verification To evaluate the actual performance of MP - ROGD, the researchers conducted numerical experiments and compared it with existing benchmark algorithms. The experimental results show that when the constraint conditions are unknown, the performance of MP - ROGD is significantly inferior to that of algorithms with complete constraint information; and in the zero - order feedback case, the performance of MP - ROGD is also inferior to that of algorithms with first - order feedback. This indicates that achieving safe OCO with less information comes at a cost. ### Conclusion This paper studied the multi - point feedback safe online convex optimization problem using only zero - order information under unknown constraint conditions and proposed an effective algorithm MP - ROGD. Future research directions may include exploring the possibility of achieving safe OCO with less constraint information and applying this algorithm to related problems such as distributed online optimization or online control.