MolPipeline : A python package for processing molecules with RDKit in scikit-learn

Christian Wolfgang Feldmann,Jochen Sieg,Miriam Mathea,Jennifer Hemmerich,Conrad Stork,Frederik Sandfort,Philipp Eiden

DOI: https://doi.org/10.26434/chemrxiv-2024-kd11b

2024-04-19

Abstract:The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to chemoinformatics by wrapping default functionalities of RDKit, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. In addition, we included common cheminformatics tasks, like scaffold splits and molecular standardization, natively in the pipeline framework and adaptable for the needs of various projects.

Chemistry

What problem does this paper attempt to address?

This paper introduces MolPipeline, a Python-based software package for processing molecular data in the scikit-learn framework, combining the functionality of RDKit. The main goal of MolPipeline is to solve the problem of building automated end-to-end workflows in chemoinformatics, especially for large-scale datasets. The paper emphasizes the flexibility and error handling mechanisms when dealing with erroneous instances, which usually require manual intervention in conventional pipelines. MolPipeline includes the following key features: 1. Automation: Full automation from molecular dataset to deployable machine learning models. 2. Scalability: Parallelization and low memory usage achieved through instance-based processing. 3. Flexible building blocks: Provides standard pipeline building modules for constructing custom pipelines for different chemoinformatics tasks. 4. Error handling: Tracking, logging, and substitution of failed instances, such as incorrectly parsed SMILES strings. 5. Serialization: Supports pipeline reuse and version control. The paper also mentions some examples, such as automated input format determination, molecular standardization (e.g., salt removal), molecule-based clustering (e.g., Murcko scaffolds), and molecular encoding tasks. In addition, MolPipeline provides error filtering and reinsertion functionalities to automatically handle invalid instances in large datasets without manual intervention. By integrating the chemical computation capabilities of RDKit and the machine learning algorithms of scikit-learn, MolPipeline aims to simplify and optimize the construction process of molecular property prediction models, particularly suitable for low data scenarios in the real world. The paper also describes how to use MolPipeline for hyperparameter optimization and model selection, providing specific code examples.

MolPipeline : A python package for processing molecules with RDKit in scikit-learn

Scikit-Mol brings cheminformatics to Scikit-Learn

molli: A General-Purpose Python Toolkit for Combinatorial Small Molecule Library Generation, Manipulation, and Feature Extraction.

BuildAMol: A versatile Python toolkit for fragment-based molecular design

DeepMol: An Automated Machine and Deep Learning Framework for Computational Chemistry

Biomedr: An R/Cran Package For Integrated Data Analysis Pipeline In Biomedical Study

Python-Based Interactive RDKit Molecule Editing with rdEditor

AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

MolDy: molecular dynamics simulation made easy

ChemSuite: A package for chemoinformatics calculations and machine learning

An automated data analysis pipeline for GC-TOF-MS metabonomics studies.

Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science

An automated Calculation Pipeline for Differential Pair Interaction Energies with Molecular Force Fields using the Tinker Molecular Modeling Package

maplet: An extensible R toolbox for modular and reproducible omics pipelines

Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline

StreaMD: the toolkit for high-throughput molecular dynamics simulations

Cheminformatics Microservice: unifying access to open cheminformatics toolkits

SimpleMetaPipeline: Breaking the bioinformatics bottleneck in metabarcoding

Automated Analysis of Interfaces, Interactions and Self-Assembly in Soft Matter Simulations

MolGraph: a Python package for the implementation of molecular graphs and graph neural networks with TensorFlow and Keras