PythiaCHEM : a user-friendly machine learning toolkit for chemistry

Fernanda Duarte,Zonghua Bo,Stamatia Zavitsanou,Emanuele Casali,Matthew Langton
DOI: https://doi.org/10.26434/chemrxiv-2024-fqdtn
2024-01-12
Abstract:Machine learning (ML) is currently transforming the field of chemistry by offering unparalleled efficiency in addressing complex challenges. Despite the progress made, a notable gap persists in the availability of user-friendly tools tailored to chemical problems involving small and sparse datasets. Here, we introduce PythiaCHEM, an ML toolkit designed to develop data-driven predictive ML models. It enables the use of various descriptors and ML frameworks for regression and classification tasks in an automated, flexible, and accessible manner through Jupyter Notebooks, making it easy to customize for specific tasks. We showcase the capabilities and versatility of PythiaCHEM in two distinct chemistry tasks: first, the evaluation of the transmembrane chloride anion transport activity of synthetic anion transporters, and second, the prediction of enantioselectivity in the Strecker synthesis of a-amino acids. Our results highlight the utility of PythiaCHEM as a powerful and user-friendly framework for developing predictive ML models applicable in different domains of chemistry.
Chemistry
What problem does this paper attempt to address?
This paper introduces a machine learning toolkit called PythiaCHEM, aimed at addressing the problem of user-friendly prediction model development for small and sparse datasets in the field of chemistry. Currently, although machine learning has played an important role in chemistry, the tools for such datasets are still limited. PythiaCHEM provides an automated, flexible, and user-friendly interface through Jupyter Notebook, allowing users to customize it for specific tasks in regression and classification. The toolkit employs various descriptors and machine learning frameworks, demonstrating its capabilities in two specific chemical tasks: evaluating the transmembrane chloride ion transport activity of synthetic ion carriers and predicting the stereoselectivity of amino acids in Stille reactions. These applications demonstrate that PythiaCHEM is a powerful and user-friendly framework that can develop prediction models in different chemical domains. PythiaCHEM includes six modules integrated with Python, utilizing open-source libraries such as RDKit and scikit-learn, and provides various machine learning algorithms such as LASSO, logistic regression, and support vector machines. Users can choose suitable algorithms based on factors such as dataset size, feature complexity, and model interpretability. Additionally, the toolkit offers functionalities for data preprocessing, feature generation, and selection to adapt to various chemical problems. With this toolkit, researchers can efficiently build data-driven prediction models, particularly in cases where data is limited, reducing reliance on large-scale datasets and improving prediction efficiency and accuracy.