Abstract:Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the dominant power law nature of learning curves for Bayesian optimization. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of Hyperparameter Optimization (HPO) in the field of machine learning, particularly the high-cost evaluation challenges faced by deep learning methods when dealing with a large number of configurations. The core contribution of the paper is the proposal of a new multi-fidelity hyperparameter optimization method that leverages the power-law distribution characteristic of learning curves. Specifically, the main objectives of the paper can be summarized as follows: 1. **Introduction of a power-law-based surrogate model**: The researchers proposed the "Deep Power Laws (DPL)" model, which is an ensemble model based on neural networks that can predict the performance under different hyperparameter configurations, and these predictions follow a power-law distribution pattern. This method uses gray-box evaluation to dynamically decide which configurations should be paused and how to progressively train them. 2. **New mechanism combined with Bayesian optimization**: The DPL model is used as a surrogate model for Bayesian optimization to estimate the performance of a given configuration under future budgets. This allows the method to make decisions based on partially observed data, thereby effectively allocating resources. 3. **Empirical superiority demonstration**: The paper demonstrates that the DPL method achieves the best results compared to seven state-of-the-art competitors on three different benchmark datasets (tabular data, image data, and natural language processing data). The experiments cover 59 diverse tasks, including advanced deep learning architectures such as Transformer, XFormer, and ResNeXt. In summary, the main problem the paper attempts to solve is to improve the efficiency and feasibility of hyperparameter optimization in deep learning by leveraging the power-law characteristics of learning curves. The proposed DPL method is not only theoretically novel but also proven effective in practical applications.

Scaling Laws for Hyperparameter Optimization

Power-law Scaling to Assist with Key Challenges in Artificial Intelligence

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training

Optimization Hyper-parameter Laws for Large Language Models

A Hitchhiker's Guide to Scaling Law Estimation

Scaling Exponents Across Parameterizations and Optimizers

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization

Broken Neural Scaling Laws

Explaining Neural Scaling Laws

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

A Dynamical Model of Neural Scaling Laws

Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning

Scaling Laws for Transfer

A Solvable Model of Neural Scaling Laws

Bayesian Hyperparameter Optimization with BoTorch, GPyTorch and Ax

A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization

Scalable Nested Optimization for Deep Learning

Efficient Hyper-parameter Optimization for NLP Applications.

Scaling Laws for Neural Language Models