Abstract:Stochastic gradient descent (SGD) is a powerful optimization technique that is particularly useful in online learning scenarios. Its convergence analysis is relatively well understood under the assumption that the data samples are independent and identically distributed (iid). However, applying SGD to policy optimization problems in operations research involves a distinct challenge: the policy changes the environment and thereby affects the data used to update the policy. The adaptively generated data stream involves samples that are non-stationary, no longer independent from each other, and affected by previous decisions. The influence of previous decisions on the data generated introduces bias in the gradient estimate, which presents a potential source of instability for online learning not present in the iid case. In this paper, we introduce simple criteria for the adaptively generated data stream to guarantee the convergence of SGD. We show that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in. Our Lyapunov-function analysis allows one to translate existing stability analysis of stochastic systems studied in operations research into convergence rates for SGD, and we demonstrate this for queueing and inventory management problems. We also showcase how our result can be applied to study the sample complexity of an actor-critic policy gradient algorithm.

Low-Cost Lipschitz-Independent Adaptive Importance Sampling of Stochastic Gradients.

Multiple importance sampling for stochastic gradient estimation

Lsh-sampling Breaks the Computation Chicken-and-egg Loop in Adaptive Stochastic Gradient Estimation

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

Importance Sampling for Stochastic Gradient Descent in Deep Neural Networks

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm

Accelerating Stochastic Gradient Descent Using Antithetic Sampling.

Adaptive Variance Reducing for Stochastic Gradient Descent.

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

Derivative-Free Optimization via Adaptive Sampling Strategies

Gradient Importance Sampling

ADASS: Adaptive Sample Selection for Training Acceleration

Resampling Stochastic Gradient Descent Cheaply for Efficient Uncertainty Quantification

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

Stochastic Gradient Descent with Adaptive Data

Shuffling Gradient Descent-Ascent with Variance Reduction for Nonconvex-Strongly Concave Smooth Minimax Problems

Stochastic Approximate Gradient Descent via the Langevin Algorithm

Low-Precision Stochastic Gradient Langevin Dynamics

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent