Scikit-Mol brings cheminformatics to Scikit-Learn

Esben Jannik Bjerrum,Rafał Adam Bachorz,Adrien Bitton,Oh-hyeon Choung,Ya Chen,Carmen Esposito,Son Viet Ha,Andreas Poehlmann
DOI: https://doi.org/10.26434/chemrxiv-2023-fzqwd
2023-12-06
Abstract:Scikit-Mol is a open-source toolkit that aims to bridge the gap between two well-established toolkits, RDKit and Scikit-Learn, in order to provide a simple interface for building cheminformatics models. By leveraging the strengths of both RDKit and Scikit-Learn, Scikit-Mol provides a powerful platform for creating predictive modeling in drug discovery and materials design. Unlike other toolkits that often integrate both chemistry and machine learning, Scikit-Mol rather aims to be a simple bridge between the two, reducing the maintenance effort required to keep up with changes and new features in e.g. Scikit-Learn. A simple example of Scikit-Mol's functionality is provided, demonstrating its compatibility with Scikit-Learn pipelines. Overall, Scikit-Mol provides a useful and flexible package for building self-contained and self-documented cheminformatics models with minimal maintenance required.
Chemistry
What problem does this paper attempt to address?
This article introduces an open source toolkit called Scikit-Mol, aiming to bridge the gap between chemoinformatics (RDKit) and machine learning (Scikit-Learn), providing a simple interface to build chemoinformatics models. Scikit-Mol leverages the strengths of RDKit and Scikit-Learn to create a powerful platform for drug discovery and materials design prediction modeling. Instead of integrating chemistry and machine learning into a single framework, Scikit-Mol acts as a simple bridge, reducing the maintenance work to adapt to changes and new features in projects like Scikit-Learn. The paper points out that although there have been other toolkits attempting to combine chemistry and machine learning, they often require significant maintenance work. Scikit-Mol, on the other hand, reduces the reliance on code maintenance by connecting two mature and widely used open source projects. It supports the creation of simple, self-contained, and self-documented models, and is compatible with Scikit-Learn pipelines, making the models serializable, easy to save, and load. The paper also provides a simple example of Scikit-Mol, demonstrating how it can be tightly integrated with Scikit-Learn for handling molecular datasets, particularly using SMILES strings as input. In addition, the paper discusses implementation details of Scikit-Mol, including different types of fingerprint and descriptor transformers, as well as standardization and transformation functionalities based on RDKit. The paper also mentions optimization for parallel computing, showcasing speed improvements in computations on multi-core processors. Finally, the documentation and availability of Scikit-Mol are emphasized, including example notebooks and detailed documentation to facilitate its application in scientific research.