MolPipeline : A python package for processing molecules with RDKit in scikit-learn

Christian Wolfgang Feldmann,Jochen Sieg,Miriam Mathea,Jennifer Hemmerich,Conrad Stork,Frederik Sandfort,Philipp Eiden
DOI: https://doi.org/10.26434/chemrxiv-2024-kd11b
2024-04-19
Abstract:The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to chemoinformatics by wrapping default functionalities of RDKit, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. In addition, we included common cheminformatics tasks, like scaffold splits and molecular standardization, natively in the pipeline framework and adaptable for the needs of various projects.
Chemistry
What problem does this paper attempt to address?
This paper introduces MolPipeline, a Python-based software package for processing molecular data in the scikit-learn framework, combining the functionality of RDKit. The main goal of MolPipeline is to solve the problem of building automated end-to-end workflows in chemoinformatics, especially for large-scale datasets. The paper emphasizes the flexibility and error handling mechanisms when dealing with erroneous instances, which usually require manual intervention in conventional pipelines. MolPipeline includes the following key features: 1. Automation: Full automation from molecular dataset to deployable machine learning models. 2. Scalability: Parallelization and low memory usage achieved through instance-based processing. 3. Flexible building blocks: Provides standard pipeline building modules for constructing custom pipelines for different chemoinformatics tasks. 4. Error handling: Tracking, logging, and substitution of failed instances, such as incorrectly parsed SMILES strings. 5. Serialization: Supports pipeline reuse and version control. The paper also mentions some examples, such as automated input format determination, molecular standardization (e.g., salt removal), molecule-based clustering (e.g., Murcko scaffolds), and molecular encoding tasks. In addition, MolPipeline provides error filtering and reinsertion functionalities to automatically handle invalid instances in large datasets without manual intervention. By integrating the chemical computation capabilities of RDKit and the machine learning algorithms of scikit-learn, MolPipeline aims to simplify and optimize the construction process of molecular property prediction models, particularly suitable for low data scenarios in the real world. The paper also describes how to use MolPipeline for hyperparameter optimization and model selection, providing specific code examples.