Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery.

Daniel Crusius,Flaviu Cipcigan,Philip Biggin

DOI: https://doi.org/10.1039/d4fd00091a

2024-06-05

Faraday Discussions

Abstract:Data-driven techniques for establishing quantitative structure property relations are a pillar of modern materials and molecular discovery. Fuelled by the recent progress in deep learning methodology and the abundance of new algorithms, it is tempting to chase benchmarks and incrementally build ever more capable machine learning (ML) models. While model evaluation has made significant progress, the intrinsic limitations arising from the underlying experimental data are often overlooked. In the chemical sciences data collection is costly, thus datasets are small and experimental errors can be significant. These limitations of such datasets affect their predictive power, a fact that is rarely considered in a quantitative way. In this study, we analyse commonly used ML datasets for regression and classification from drug discovery, molecular discovery, and materials discovery. We derived maximum and realistic performance bounds for nine such datasets by introducing noise based on estimated or actual experimental errors. We then compared the estimated performance bounds to the reported performance of leading ML models in the literature. Out of the nine datasets and corresponding ML models considered, four were identified to have reached or surpassed dataset performance limitations and thus, they may potentially be fitting noise. More generally, we systematically examine how data range, the magnitude of experimental error, and the number of data points influence dataset performance bounds. Alongside this paper, we release the Python package NoiseEstimator and provide a web- based application for computing realistic performance bounds. This study and the resulting tools will help practitioners in the field understand the limitations of datasets and set realistic expectations for ML model performance. This work stands as a reference point, offering analysis and tools to guide development of future ML models in the chemical sciences.

chemistry, physical

What problem does this paper attempt to address?

This paper discusses the issue of predictive performance of commonly used machine learning (ML) datasets in the fields of drug discovery, molecular discovery, and material discovery. The study focuses on how the inherent experimental errors in the datasets limit the performance of models. The authors determine the maximum and realistic performance bounds of nine datasets by introducing noise based on estimated or actual experimental errors, and compare these bounds with the performance of leading ML models reported in the literature. The results show that four datasets may have already reached or exceeded the performance limits caused by data noise, suggesting that these models may be overfitting the noise. The paper also systematically investigates how data range, size of experimental errors, and number of data points affect the performance bounds, and provides the Python package NoiseEstimator and a web-based application for calculating realistic performance bounds. This work aims to help practitioners in the field understand the limitations of datasets and set realistic expectations for the development of future ML models.

Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery.

Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery.

Understanding the Limitations of Deep Models for Molecular Property Prediction: Insights and Solutions.

Current Methods for Drug Property Prediction in the Real World

Denoising Drug Discovery Data for Improved ADMET Property Prediction

Audacity of huge: overcoming challenges of data scarcity and data quality for machine learning in computational materials discovery

Learning data efficient coarse-grained molecular dynamics from forces and noise

Quantifying the performance of machine learning models in materials discovery

Data-Driven Materials Discovery and Synthesis using Machine Learning Methods

Machine Learning Small Molecule Properties in Drug Discovery

A strategy to apply machine learning to small datasets in materials science

A call for an industry-led initiative to critically assess machine learning for real-world drug discovery

From Prediction to Action: Critical Role of Performance Estimation for Machine-Learning-Driven Materials Discovery

A systematic study of key elements underlying molecular property prediction

A critical examination of robustness and generalizability of machine learning prediction of materials properties

Performance Insights for Small Molecule Drug Discovery Models: Data Scaling, Multitasking, and Generalization

Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data

Unraveling Key Elements Underlying Molecular Property Prediction: A Systematic Study

Distance-based Analysis of Machine Learning Prediction Reliability for Datasets in Materials Science and Other Fields

MD-HIT: Machine learning for materials property prediction with dataset redundancy control

Extrapolative prediction of small-data molecular property using quantum mechanics-assisted machine learning