A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

Minyoung Kim,Timothy M. Hospedales
2024-10-14
Abstract:We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning, invariance learning and more. These problems are often formalized as Bi-Level optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two-fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyperparameters in the case of Vision Transformers.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of differentiable meta learning, which is prevalent in modern deep learning. These problems are typically formalized as Bi-Level Optimization (BLO). Specifically, the paper focuses on BLO problems in areas such as hyperparameter optimization, loss function learning, few-shot learning, and invariant learning. Existing BLO methods face several challenges when dealing with these tasks, such as: 1. **Uncertainty in inner optimization**: Due to the use of minibatch SGD in practical applications and the limited number of iterations, the inner optimization may not fully converge. Additionally, for non-convex neural networks, the inner optimization may have multiple local minima, which increases the uncertainty of the solution. 2. **Instability of existing methods**: Methods based on the Implicit Function Theorem (IFT) are theoretically more memory-efficient but are prone to instability in practice due to the quality of inner optimization, leading to unstable behavior and sensitivity to hyperparameters. To address these challenges, the paper proposes a new stochastic gradient method to compute hypergradients in BLO. Specifically, the authors extend the standard deterministic BLO problem to a stochastic optimization problem, where the inner optimization generates a smooth probability distribution, and the outer optimization targets the expectation of this inner distribution. This approach better handles the uncertainty in inner optimization, thereby improving the robustness and reliability of the algorithm.