Scaling Laws for Hyperparameter Optimization

Arlind Kadra,Maciej Janowski,Martin Wistuba,Josif Grabocka
2023-10-26
Abstract:Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the dominant power law nature of learning curves for Bayesian optimization. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.
Machine Learning
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of Hyperparameter Optimization (HPO) in the field of machine learning, particularly the high-cost evaluation challenges faced by deep learning methods when dealing with a large number of configurations. The core contribution of the paper is the proposal of a new multi-fidelity hyperparameter optimization method that leverages the power-law distribution characteristic of learning curves. Specifically, the main objectives of the paper can be summarized as follows: 1. **Introduction of a power-law-based surrogate model**: The researchers proposed the "Deep Power Laws (DPL)" model, which is an ensemble model based on neural networks that can predict the performance under different hyperparameter configurations, and these predictions follow a power-law distribution pattern. This method uses gray-box evaluation to dynamically decide which configurations should be paused and how to progressively train them. 2. **New mechanism combined with Bayesian optimization**: The DPL model is used as a surrogate model for Bayesian optimization to estimate the performance of a given configuration under future budgets. This allows the method to make decisions based on partially observed data, thereby effectively allocating resources. 3. **Empirical superiority demonstration**: The paper demonstrates that the DPL method achieves the best results compared to seven state-of-the-art competitors on three different benchmark datasets (tabular data, image data, and natural language processing data). The experiments cover 59 diverse tasks, including advanced deep learning architectures such as Transformer, XFormer, and ResNeXt. In summary, the main problem the paper attempts to solve is to improve the efficiency and feasibility of hyperparameter optimization in deep learning by leveraging the power-law characteristics of learning curves. The proposed DPL method is not only theoretically novel but also proven effective in practical applications.