Learning Rate Schedules in the Presence of Distribution Shift

Matthew Fahrbach,Adel Javanmard,Vahab Mirrokni,Pratik Worah
2023-08-20
Abstract:We design learning rate schedules that minimize regret for SGD-based online learning in the presence of a changing data distribution. We fully characterize the optimal learning rate schedule for online linear regression via a novel analysis with stochastic differential equations. For general convex loss functions, we propose new learning rate schedules that are robust to distribution shift and we give upper and lower bounds for the regret that only differ by constants. For non-convex loss functions, we define a notion of regret based on the gradient norm of the estimated models and propose a learning schedule that minimizes an upper bound on the total expected regret. Intuitively, one expects changing loss landscapes to require more exploration, and we confirm that optimal learning rate schedules typically increase in the presence of distribution shift. Finally, we provide experiments for high-dimensional regression models and neural networks to illustrate these learning rate schedules and their cumulative regret.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily investigates how to design optimal learning rate scheduling strategies to minimize dynamic regret in online learning with stochastic gradient descent (SGD) under continuously changing data distributions. Specifically, the paper explores the following issues: 1. **Linear Regression**: - How to design optimal learning rate scheduling strategies to minimize dynamic regret under time-varying coefficients. - Introducing new stochastic differential equations (SDE) to approximate the dynamic behavior of SGD under distribution changes and deriving the optimal learning rate through the analysis of these equations. 2. **General Convex Loss Functions**: - Proposing new learning rate scheduling strategies for general convex loss functions that are robust to distribution changes. - Providing upper and lower bounds for dynamic regret and proving that the difference between these bounds is only in the constant terms. 3. **Non-Convex Loss Functions**: - Defining a regret concept based on the model gradient norm and proposing a learning rate scheduling strategy to minimize the expected cumulative regret. - Validating the effectiveness of these learning rate scheduling strategies in high-dimensional regression models and neural networks through experiments. ### Main Contributions 1. **Linear Regression**: - Proposing a novel stochastic differential equation (SDE) to approximate the dynamic behavior of SGD under distribution changes. - Deriving the optimal learning rate scheduling strategy and validating its effectiveness through theoretical analysis and experiments. 2. **General Convex Loss Functions**: - Proposing adaptive learning rate scheduling strategies through the analysis of upper and lower bounds of dynamic regret. - Proving that for strongly convex loss functions, the proposed upper and lower bounds have the same form, differing only in constant terms. 3. **Non-Convex Loss Functions**: - Modifying the definition of regret by using the model gradient norm to measure performance. - Proposing a learning rate scheduling strategy to minimize the expected cumulative regret and validating its effectiveness through experiments. ### Experimental Validation - **High-Dimensional Regression Models**: Demonstrating the performance of different learning rate scheduling strategies in high-dimensional regression models through experiments. - **Medical Applications**: Using dynamic learning rate scheduling strategies to classify continuously arriving small RNA data in flow cytometry, validating the effectiveness of the methods. ### Summary Through theoretical analysis and experimental validation, this paper systematically studies how to design optimal learning rate scheduling strategies to minimize dynamic regret under continuously changing data distributions. These strategies are applicable not only to linear regression and convex loss functions but also to non-convex loss functions, providing important theoretical and practical guidance for online learning systems.