Leveraging our Teacher’s Experience to Improve Machine Learning: Application to pKa Prediction

Nicolas Moitessier,Jérôme Genzling,Ziling Luo,Benjamin Weiser
DOI: https://doi.org/10.26434/chemrxiv-2024-bpd53
2024-01-05
Abstract:Machine learning (ML) is gaining momentum in chemistry for the prediction of various molecular properties, or for the generation of novel molecules with specific properties. However, these models can only be trained on relatively scarce, often low-quality data. Thus, memorization (rather than learn-ing) may result in poorly generalizable models. To address this issue, we aimed to revisit the way ML is practiced in chemistry. Using pKa prediction as an example, we present our strategy which involves imparting Chemistry knowledge to ML algorithms. We posit that teaching fundamental principles (e.g., electronegativity and inductive effect) to machines to predict properties (e.g., pKa), analogous to the way we teach students, will allow them to predict more advanced, yet related, properties. Thus, ML will leverage the chemists’ knowledge and qualitative principles to quantify and predict chemical properties.
Chemistry
What problem does this paper attempt to address?
This paper aims to address the issues that machine learning (ML) may encounter when predicting molecular properties in the field of chemistry, especially when the training data is limited and of low quality. The model may rely excessively on memorization rather than genuine learning, resulting in poor generalization ability. Researchers use pKa prediction as an example and propose a strategy to teach machine learning algorithms chemical knowledge, simulating the way a teacher instructs students, enabling the machine to understand and predict more complex chemical properties. They believe that by teaching basic principles, such as electronegativity and inductive effects, the model's ability to predict relevant but more advanced properties can be enhanced. The paper first points out the problems with the current application of AI in chemistry, such as the lack of close connection between experiments and computer science, the tendency for models to overfit, and the neglect of physical realism. Then, the authors choose pKa prediction as a case study and educate the machine learning model with basic principles taught in organic chemistry courses, such as resonance and inductive effects. They utilize a graph neural network (GNN) model and specifically design feature matrices for nodes and edges to include the attributes of atoms and bonds that influence pKa values. Through this approach, they develop a model that considers the long-range atom influences and train and test it on a large carefully selected and validated dataset to ensure that the model understands and learns instead of simply memorizing answers. The results show that this knowledge-based teaching strategy improves the accuracy of the model and is more precise than existing methods based on fingerprints or graphs. In summary, the paper proposes a new approach for designing and evaluating machine learning models, emphasizing the selection of descriptors based on chemical principles to enhance the model's generalization ability and interpretability. This is of significant importance for predicting molecular properties in the field of chemistry.