A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

Minyoung Kim,Timothy M. Hospedales

2024-10-14

Abstract:We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning, invariance learning and more. These problems are often formalized as Bi-Level optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two-fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyperparameters in the case of Vision Transformers.

Machine Learning

What problem does this paper attempt to address?

This paper attempts to address the problem of differentiable meta learning, which is prevalent in modern deep learning. These problems are typically formalized as Bi-Level Optimization (BLO). Specifically, the paper focuses on BLO problems in areas such as hyperparameter optimization, loss function learning, few-shot learning, and invariant learning. Existing BLO methods face several challenges when dealing with these tasks, such as: 1. **Uncertainty in inner optimization**: Due to the use of minibatch SGD in practical applications and the limited number of iterations, the inner optimization may not fully converge. Additionally, for non-convex neural networks, the inner optimization may have multiple local minima, which increases the uncertainty of the solution. 2. **Instability of existing methods**: Methods based on the Implicit Function Theorem (IFT) are theoretically more memory-efficient but are prone to instability in practice due to the quality of inner optimization, leading to unstable behavior and sensitivity to hyperparameters. To address these challenges, the paper proposes a new stochastic gradient method to compute hypergradients in BLO. Specifically, the authors extend the standard deterministic BLO problem to a stochastic optimization problem, where the inner optimization generates a smooth probability distribution, and the outer optimization targets the expectation of this inner distribution. This approach better handles the uncertainty in inner optimization, thereby improving the robustness and reliability of the algorithm.

A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

A Globally Convergent Gradient-based Bilevel Hyperparameter Optimization Method

Biased Stochastic First-Order Methods for Conditional Stochastic Optimization and Applications in Meta Learning

A General Descent Aggregation Framework for Gradient-based Bi-level Optimization

Optimistic Meta-Gradients

Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior: From Theory to Practice

Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

Gradient-based Bi-level Optimization for Deep Learning: A Survey

Investigating Bi-Level Optimization for Learning and Vision From a Unified Perspective: A Survey and Beyond

Averaged Method of Multipliers for Bi-Level Optimization without Lower-Level Strong Convexity

Adaptive Gradient-Based Meta-Learning Methods

Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials

Cross-Entropy Optimization for Hyperparameter Optimization in Stochastic Gradient-based Approaches to Train Deep Neural Networks

MALIBO: Meta-learning for Likelihood-free Bayesian Optimization

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models

Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis

On Momentum-Based Gradient Methods for Bilevel Optimization with Nonconvex Lower-Level

Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte Carlo

A Gradient-based Bilevel Optimization Approach for Tuning Hyperparameters in Machine Learning