Be aware of overfitting by hyperparameter optimization!

Igor V. Tetko,Ruud van Deursen,Guillaume Godin
2024-07-30
Abstract:Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **The necessity of hyperparameter optimization**: The paper explores whether hyperparameter optimization in machine learning truly leads to significant improvements in model performance. The authors found through experiments that in some cases, using preset hyperparameters (i.e., hyperparameters that have not been optimized) can achieve similar or even better results. This may be due to over-optimization leading to model overfitting. 2. **The demand for computational resources**: Hyperparameter optimization typically requires a large amount of computational resources, especially when dealing with large-scale datasets. The authors demonstrate how to achieve good model performance without hyperparameter optimization using relatively limited computational resources (such as ordinary clusters in an academic environment), thereby significantly reducing computational costs. 3. **Comparison of different methods**: The paper compares the performance of graph-based methods (such as Attentive FingerPrint and ChemProp) with natural language processing-based methods (such as Transformer CNN) in predicting water solubility. The results show that Transformer CNN provides higher accuracy in most cases and requires much less computation time than other methods. 4. **The impact of data cleaning and organization**: The paper analyzes the impact of different data cleaning and organization methods on model performance. The authors found that even after data cleaning and organization, models with preset hyperparameters can still perform comparably to those with optimized hyperparameters. 5. **Consistency of statistical metrics**: The paper emphasizes the importance of using the same statistical metrics when comparing the performance of different models. The authors point out the differences between traditional RMSE and custom cuRMSE, and discuss the impact of these differences on model performance evaluation. In summary, the main purpose of this paper is to explore the practical value of hyperparameter optimization in machine learning and how to efficiently train models under limited resources, while also providing a new, efficient method for predicting water solubility.