MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols

Matthew Witman,Peter Schindler
DOI: https://doi.org/10.26434/chemrxiv-2024-bmw1n
2024-08-07
Abstract:Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. A general-purpose, model-agnostic toolkit, MatFold, is provided to automate the construction of these CV splits and encourage further community use.
Chemistry
What problem does this paper attempt to address?
The paper primarily focuses on the application of machine learning (ML) models in the field of materials science, particularly on how to systematically gain insights into the performance of materials discovery models through standardized cross-validation (CV) protocols. The core issue addressed by the research is the bias in model performance estimation caused by overly simplistic cross-validation protocols, which is especially critical in downstream modeling or materials screening tasks, particularly when experimental synthesis, characterization, and testing are costly. The paper presents the following key points: 1. **Standardized Cross-Validation Protocols**: The authors propose a set of standardized and progressively challenging cross-validation split protocols for chemically and structurally motivated cross-validation to validate any machine learning model used for materials discovery. 2. **MatFold Toolkit**: A general, model-agnostic toolkit called MatFold is provided for the automated construction of these cross-validation splits, encouraging further community use. 3. **Insights into Model Generalization**: Through this series of cross-validation protocols, one can systematically gain insights into the model's generalization ability, improvability, and uncertainty, providing benchmarks for fair comparison of different models and systematically reducing potential data leakage. 4. **Empirical Studies**: The paper demonstrates the effectiveness of the MatFold tool through two case studies (prediction of defect formation energy and work function of surfaces), providing an in-depth analysis of how model performance is affected by different cross-validation strategies. Overall, the study aims to address the issue of evaluating the generalization ability of machine learning models in the field of materials science, particularly on how to improve the reliability and practicality of models through more rigorous and systematic cross-validation methods.