Abstract:Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. A general-purpose, model-agnostic toolkit, MatFold, is provided to automate the construction of these CV splits and encourage further community use.

What problem does this paper attempt to address?

The paper primarily focuses on the application of machine learning (ML) models in the field of materials science, particularly on how to systematically gain insights into the performance of materials discovery models through standardized cross-validation (CV) protocols. The core issue addressed by the research is the bias in model performance estimation caused by overly simplistic cross-validation protocols, which is especially critical in downstream modeling or materials screening tasks, particularly when experimental synthesis, characterization, and testing are costly. The paper presents the following key points: 1. **Standardized Cross-Validation Protocols**: The authors propose a set of standardized and progressively challenging cross-validation split protocols for chemically and structurally motivated cross-validation to validate any machine learning model used for materials discovery. 2. **MatFold Toolkit**: A general, model-agnostic toolkit called MatFold is provided for the automated construction of these cross-validation splits, encouraging further community use. 3. **Insights into Model Generalization**: Through this series of cross-validation protocols, one can systematically gain insights into the model's generalization ability, improvability, and uncertainty, providing benchmarks for fair comparison of different models and systematically reducing potential data leakage. 4. **Empirical Studies**: The paper demonstrates the effectiveness of the MatFold tool through two case studies (prediction of defect formation energy and work function of surfaces), providing an in-depth analysis of how model performance is affected by different cross-validation strategies. Overall, the study aims to address the issue of evaluating the generalization ability of machine learning models in the field of materials science, particularly on how to improve the reliability and practicality of models through more rigorous and systematic cross-validation methods.

MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols

Machine Learning and Materials Informatics Approaches for Predicting Transverse Mechanical Properties of Unidirectional CFRP Composites with Microvoids

Gaining Confidence on Molecular Classification Through Consensus Modeling and Validation

Matminer: an Open Source Toolkit for Materials Data Mining

Towards Foundation Models for Materials Science: The Open MatSci ML Toolkit

MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling

Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

Reliable and Explainable Machine Learning Methods for Accelerated Material Discovery

Random projections and Kernelised Leave One Cluster Out Cross-Validation: Universal baselines and evaluation tools for supervised machine learning for materials properties

Step Forward Cross Validation for Bioactivity Prediction: Out of Distribution Validation in Drug Discovery

A critical examination of robustness and generalizability of machine learning prediction of materials properties

External validation of machine learning models - registered models and adaptive sample splitting

Efficient, adaptive cross-validation for tuning and comparing models, with application to drug discovery

Data-Driven Materials Discovery and Synthesis using Machine Learning Methods

NJmat: Data-Driven Machine Learning Interface to Accelerate Material Design

Machine Learning Materials Properties with Accurate Predictions, Uncertainty Estimates, Domain Guidance, and Persistent Online Accessibility

MD-HIT: Machine learning for material property prediction with dataset redundancy control

Interpretable Machine Learning for Materials Design

chemmodlab: A Cheminformatics Modeling Laboratory for Fitting and Assessing Machine Learning Models

Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm

The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science