DeepMol: An Automated Machine and Deep Learning Framework for Computational Chemistry

Joao Correia,Joao Capela,Miguel Rocha
DOI: https://doi.org/10.1101/2024.05.27.595849
2024-06-01
Abstract:The domain of computational chemistry has experienced a significant evolution due to the introduction of Machine Learning (ML) technologies. Despite its potential to revolutionize the field, researchers are often encumbered by obstacles, such as the complexity of selecting optimal algorithms, the automation of data pre-processing steps, the necessity for adaptive feature engineering, and the assurance of model performance consistency across different datasets. Addressing these issues head-on, stands out as an Automated ML (AutoML) tool by automating critical steps of the ML pipeline. rapidly and automatically identifies the most effective data representation, pre-processing methods and model configurations for a specific molecular property/activity prediction problem. On 22 benchmark datasets, obtained competitive pipelines compared with those requiring time-consuming feature engineering, model design and selection processes. As one of the first AutoML tools specifically developed for the computational chemistry domain, stands out with its open-source code, in-depth tutorials, detailed documentation, and examples of real-world applications, all available at https://github.com/BioSystemsUM/DeepMol and https://deepmol.readthedocs.io/en/latest/. By introducing AutoML as a groundbreaking feature in computational chemistry, DeepMol establishes itself as the pioneering state-of-the-art tool in the field.
Bioinformatics
What problem does this paper attempt to address?
This paper introduces an automated machine learning (AutoML) tool called DeepMol, specifically designed for the field of computational chemistry. The study points out that although machine learning techniques have potential applications in computational chemistry, researchers often face challenges in selecting the best algorithms, automating data preprocessing, adapting feature engineering, and ensuring consistent model performance on different datasets. As an AutoML tool, DeepMol automates key steps in the machine learning pipeline, enabling rapid and automatic identification of data representations, preprocessing methods, and model configurations for specific molecular property/activity prediction problems. DeepMol achieves competitive performance on 22 benchmark datasets compared to time-consuming feature engineering, model design, and selection processes. As one of the pioneering AutoML tools in the field of computational chemistry, DeepMol stands out with its open-source code, in-depth tutorials, detailed documentation, and real-world application examples. It introduces AutoML as a breakthrough feature in the field of computational chemistry and establishes its groundbreaking position. The paper also discusses the development of quantitative structure-activity/property relationship (QSAR/QSPR) models in computational chemistry, emphasizing the increasing demand for advanced techniques and deep learning with the availability of more data and model complexity. DeepMol addresses the challenges of automated preprocessing, model selection, and performance consistency in QSAR/QSPR modeling, providing a comprehensive and customizable AutoML framework. Key features of DeepMol include modular design, allowing users to customize every step from data loading and processing to model prediction and interpretability analysis. It also has the capability to automatically optimize preprocessing methods, data engineering techniques, and machine learning/deep learning models and hyperparameters. Furthermore, it supports various machine learning tasks such as binary classification, multi-classification, multi-label/multi-task classification, and regression. DeepMol utilizes existing toolkits such as RDKit, Scikit-Learn, TensorFlow, DeepChem, and Optuna to enhance its functionality and efficiency.