Data-Error Scaling in Machine Learning on Natural Discrete Combinatorial Mutation-prone Sets: Case Studies on Peptides and Small Molecules

Vanni Doffini,O. Anatole von Lilienfeld,Michael A. Nash
2024-05-09
Abstract:We investigate trends in the data-error scaling behavior of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computationally generated training data. Our synthetic datasets comprise i) two naïve functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs. In contrast to typical data-error scaling, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and the concept of mutant based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.
Chemical Physics,Machine Learning
What problem does this paper attempt to address?
This paper discusses the error scaling behavior of data in machine learning (ML) when dealing with easily varying discrete combinatorial spaces such as proteins or small molecules. In this study, the authors analyze synthetic datasets using different numbers of computationally generated training data by training and evaluating kernel ridge regression machines. The datasets include simple functions based on many-body theory, energy estimation of protein-ligand binding for conformationally diverse peptides, and solvation energies of two hexagonal atomic structures. They found that compared to typical methods, ML techniques show improved performance in handling data errors caused by variations in combinatorial spaces.