Abstract:Scikit-Mol is a open-source toolkit that aims to bridge the gap between two well-established toolkits, RDKit and Scikit-Learn, in order to provide a simple interface for building cheminformatics models. By leveraging the strengths of both RDKit and Scikit-Learn, Scikit-Mol provides a powerful platform for creating predictive modeling in drug discovery and materials design. Unlike other toolkits that often integrate both chemistry and machine learning, Scikit-Mol rather aims to be a simple bridge between the two, reducing the maintenance effort required to keep up with changes and new features in e.g. Scikit-Learn. A simple example of Scikit-Mol's functionality is provided, demonstrating its compatibility with Scikit-Learn pipelines. Overall, Scikit-Mol provides a useful and flexible package for building self-contained and self-documented cheminformatics models with minimal maintenance required.

What problem does this paper attempt to address?

This article introduces an open source toolkit called Scikit-Mol, aiming to bridge the gap between chemoinformatics (RDKit) and machine learning (Scikit-Learn), providing a simple interface to build chemoinformatics models. Scikit-Mol leverages the strengths of RDKit and Scikit-Learn to create a powerful platform for drug discovery and materials design prediction modeling. Instead of integrating chemistry and machine learning into a single framework, Scikit-Mol acts as a simple bridge, reducing the maintenance work to adapt to changes and new features in projects like Scikit-Learn. The paper points out that although there have been other toolkits attempting to combine chemistry and machine learning, they often require significant maintenance work. Scikit-Mol, on the other hand, reduces the reliance on code maintenance by connecting two mature and widely used open source projects. It supports the creation of simple, self-contained, and self-documented models, and is compatible with Scikit-Learn pipelines, making the models serializable, easy to save, and load. The paper also provides a simple example of Scikit-Mol, demonstrating how it can be tightly integrated with Scikit-Learn for handling molecular datasets, particularly using SMILES strings as input. In addition, the paper discusses implementation details of Scikit-Mol, including different types of fingerprint and descriptor transformers, as well as standardization and transformation functionalities based on RDKit. The paper also mentions optimization for parallel computing, showcasing speed improvements in computations on multi-core processors. Finally, the documentation and availability of Scikit-Mol are emphasized, including example notebooks and detailed documentation to facilitate its application in scientific research.

Scikit-Mol brings cheminformatics to Scikit-Learn

MolPipeline : A python package for processing molecules with RDKit in scikit-learn

molli: A General-Purpose Python Toolkit for Combinatorial Small Molecule Library Generation, Manipulation, and Feature Extraction.

ChemSuite: A package for chemoinformatics calculations and machine learning

BuildAMol: A versatile Python toolkit for fragment-based molecular design

DeepMol: An Automated Machine and Deep Learning Framework for Computational Chemistry

MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design

MolScore: A scoring and evaluation framework for de novo drug design

chemmodlab: A Cheminformatics Modeling Laboratory for Fitting and Assessing Machine Learning Models

MolData, a molecular benchmark for disease and target based machine learning

MolCompass: multi-tool for the navigation in chemical space and visual validation of QSAR/QSPR models

Metis - A Python-Based User Interface to Collect Expert Feedback for Generative Chemistry Models

MolFeSCue: enhancing molecular property prediction in data-limited and imbalanced contexts using few-shot and contrastive learning

MLatom 3: Platform for machine learning-enhanced computational chemistry simulations and workflows

$\texttt{MiniMol}$: A Parameter-Efficient Foundation Model for Molecular Learning

MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts

ChatMol: Interactive Molecular Discovery with Natural Language

MolOptimizer: A Molecular Optimization Toolkit for Fragment-Based Drug Design

MolSnapper: Conditioning Diffusion for Structure Based Drug Design

FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning

MLatom 3: A Platform for Machine Learning-Enhanced Computational Chemistry Simulations and Workflows