External validation of machine learning models - registered models and adaptive sample splitting
Giuseppe Gallitto,Robert Englert,Balint Kincses,Raviteja Kotikalapudi,Jialin Li,Kevin Hoffschlag,Ulrike Bingel,Tamas Spisak
DOI: https://doi.org/10.1101/2023.12.01.569626
2024-05-10
Abstract:Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.
Bioinformatics