Small molecule machine learning: All models are wrong, some may not even be useful

Fleming Kretschmer,Jan Seipp,Marcus Ludwig,Gunnar W. Klau,Sebastian Böcker
DOI: https://doi.org/10.1101/2023.03.27.534311
2024-03-21
Abstract:Small molecule machine learning tries to predict chemical, biochemical or biological properties from the structure of a molecule. Applications include prediction of toxicity, ligand binding or retention time. A recent trend is to develop end-to-end models that avoid the explicit integration of domain knowledge via inductive bias. A central assumption in doing so, is that there is no coverage bias in the training and evaluation data, meaning that these data are a representative subset of the true distribution we want to learn. Usually, the domain of applicability is neither considered nor analyzed for such large-scale end-to-end models. Here, we investigate how well certain large-scale datasets from the field cover the space of all known biomolecular structures. Investigation of coverage requires a sensible distance measure between molecular structures. We use a well-known distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which agrees well with the chemical intuition of similarity between compounds. Unfortunately, this computational problem is provably hard, severely restricting the use of the corresponding distance measure in large-scale studies. We introduce an exact approach that combines Integer Linear Programming and intricate heuristic bounds to ensure efficient computations and dependable results. We find that several large-scale datasets frequently used in this domain of machine learning are far from a uniform coverage of known biomolecular structures. This severely confines the predictive power of models trained on this data. Next, we propose two further approaches to check if a training dataset differs substantially from the distribution of known biomolecular structures. On the positive side, our methods may allow creators of large-scale datasets to identify regions in molecular structure space where it is advisable to provide additional training data.
Bioinformatics
What problem does this paper attempt to address?
This paper discusses the problem of dataset representativeness in small molecule machine learning. The research indicates that the current trend is to develop end-to-end models that do not rely on explicit integration of domain knowledge, but assume that the training and evaluation data cover no biases and are representative subsets of the true distribution to be learned. However, for large-scale end-to-end models, their application domain is often not considered or analyzed. The authors study the coverage of multiple commonly used large datasets in the space of biophysical molecule structures using a well-known distance metric method based on the Maximum Common Edge Subgraph (MCES) distance. Since MCES computation is an NP-hard problem, they propose an exact method that combines integer linear programming and sophisticated heuristic bounds to achieve efficient computation and reliable results. The study finds that some commonly used datasets are far from uniform in their coverage of the space of small molecule structures, which limits the predictive capabilities of models trained on these data. In addition, the authors propose two methods to examine whether the training data significantly deviates from the distribution of known biophysical molecule structures. These methods may help dataset creators identify regions of molecule structures that require additional training data. The paper also discusses the advantages and disadvantages of molecular fingerprints and the MCES method in estimating molecular structure similarity, and presents the distribution patterns of different datasets, revealing biases in the structure of certain datasets. Finally, they emphasize the importance of choosing appropriate methods when computing distances to ensure that models are not used for problems beyond their scope.