Machine-learning models for combinatorial catalyst discovery

Gregory A Landrum,Julie E Penzotti,Santosh Putta
DOI: https://doi.org/10.1088/0957-0233/16/1/035
IF: 2.398
2004-12-18
Measurement Science and Technology
Abstract:A variety of machine learning algorithms, including hierarchical clustering, decision trees, k-nearest neighbours, support vector machines and bagging, were applied to construct models to predict the molecular weight of the polymers produced by a set of 96 homogeneous catalysts. The goal of the study was to develop models that could be used to screen large virtual libraries of catalysts in order to suggest candidates for further synthesis and screening. The descriptors used to represent the catalysts did not require detailed information about the catalysts themselves; they could be calculated using only the topology of the ligands. Using an initial set of five descriptors, model accuracies of about 70% were observed from each learning algorithm. A larger descriptor set (with ten descriptors) allowed bag classifiers that were 80% accurate to be built. All models were carefully evaluated to detect overfitting (memorization of the training data) and one example of the effects of overfitting is provided. Because the descriptors used in this study can be calculated very rapidly and the models themselves are very efficient, these bag classifiers are well suited to screening very large virtual libraries.
engineering, multidisciplinary,instruments & instrumentation
What problem does this paper attempt to address?