Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization

Thomas Nagler,Lennart Schneider,Bernd Bischl,Matthias Feurer
2024-05-24
Abstract:Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model's generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.
Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses how reshuffling resampling splits can potentially improve the generalization performance of models in hyperparameter optimization (HPO). Typically, HPO evaluates different configurations by using fixed training-validation splits or cross-validation to minimize the estimated generalization error. However, the paper points out that reshuffling the splits for each configuration can improve the final model's generalization performance on unseen data. Theoretically, the paper analyzes how reshuffling affects the asymptotic behavior of the validation loss surface and provides an upper bound on the expected regret in extreme cases, linking potential benefits with the signal and noise characteristics of the optimization problem. Through controlled simulation studies, they confirm these theoretical insights and demonstrate the practical utility of reshuffling in large-scale, real-world HPO experiments. The experiments show that reshuffling can improve test performance, particularly under a single training-validation split protocol, often making the validation competitive with standard cross-validation while reducing computational costs. Additionally, the paper discusses the relationship between reshuffling and overfitting, as well as its impact on algorithms such as random search and Bayesian optimization (BO). While reshuffling typically has a small impact on 5-fold cross-validation, its improvement is particularly significant for holdout methods, achieving comparable generalization performance to 5-fold cross-validation without increasing computational costs. In conclusion, this paper proposes a simple yet less-known technique of reshuffling the splits in the HPO process, which can effectively improve the generalization ability of machine learning models, especially when the loss surface is flat and the estimation noise is large. This finding contributes to optimizing hyperparameter selection strategies to improve model performance on new data.