QSARtuna: an automated QSAR modelling platform for molecular property prediction in drug design

Lewis Mervin,Alexey Voronov,Mikhail Kabeshov,Ola Engkvist
DOI: https://doi.org/10.26434/chemrxiv-2024-2rlk7-v2
2024-03-27
Abstract:Machine-learning (ML) and Deep-Learning (DL) approaches to predict the molecular properties of small molecules are increasingly deployed within the design-make-test-analyse (DMTA) drug design cycle to predict molecular properties of interest. Despite this uptake, there are only a few automated packages to aid their development and deployment that also support uncertainty estimation, model explainability and other key aspects of model usage. This represents a key unmet need within the field and the large number of molecular representations and algorithms (and associated parameters) means it is non-trivial to robustly optimise, evaluate, reproduce, and deploy models. Here we present QSARtuna, a molecule property prediction modelling pipeline, written in Python and utilising the Optuna, Scikit-learn, RDKit and ChemProp packages, which enables the efficient and automated comparison between molecular representations and machine learning models. The platform was developed considering the increasingly important aspect of model uncertainty quantification and explainability by design. We provide details for our framework and provide illustrative examples to demonstrate the capability of the software when applied to simple molecular property, reaction/reactivity prediction and DNA encoded library enrichment analyses. We hope that the release of QSARtuna will further spur innovation in automatic ML modelling and provide a platform for education of best practises in molecular property modelling. The code to the Qptuna framework is made freely available via GitHub.
Chemistry
What problem does this paper attempt to address?
This paper mainly addresses the lack of automated Quantitative Structure-Activity Relationship (QSAR) modeling platforms in drug design. Although existing methods have been widely applied to predict molecular properties, there are not many automated tools that support key features such as uncertainty estimation and model interpretability. QSARtuna is an open-source platform written in Python, utilizing libraries such as Optuna, Scikit-learn, RDKit, and ChemProp, to effectively compare different molecular representations and machine learning models. This platform specifically focuses on quantifying model uncertainty and interpretability in design, and provides examples for simple molecular property prediction, reaction activity prediction, and DNA encoding library enrichment analysis. QSARtuna aims to promote innovation in automated machine learning modeling and provide an educational platform for best practices in molecular property modeling. The code is publicly available on GitHub to facilitate further research and applications.