Benchmarking ML in ADMET predictions: the practical impact of feature representations in ligand-based models

Gintautas Kamuntavičius,Orestis Bastas,Tanya Paquet,Roy Tal,Povilas Norvaišas,Dainius Šalkauskas,Alvaro Prat,Hisham Abdel Aty
DOI: https://doi.org/10.26434/chemrxiv-2023-x6tjr-v5
2024-04-17
Abstract:This study, focusing on predicting Absorption, Distribution, Metabolism, Excretion, and Toxicology (ADMET) properties, addresses the key challenges of ML models trained using ligand-based representations. We propose a structured approach to data feature selection, taking a step beyond the conventional practice of combining different representations without systematic reasoning. Additionally, we enhance model evaluation methods by integrating cross-validation with statistical hypothesis testing, adding a layer of reliability to the model assessments. Our final evaluations include a practical scenario, where models trained on one source of data are evaluated on a different one. This approach aims to bolster the reliability of ADMET predictions, providing more dependable and informative model evaluations.
Chemistry
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the key challenges faced in ligand - based machine - learning (ML) models for predicting drug absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties. Specifically, the research mainly focuses on the following aspects: 1. **Feature Representation Selection**: - The paper proposes a systematic method for selecting data features, going beyond the traditional practice of simply combining different representations without systematic reasoning. The author verifies the impact of different feature combinations on model performance through experiments and finds the optimal feature combination. 2. **Improvement of Model Evaluation Methods**: - The author introduces a method that combines cross - validation with statistical hypothesis testing to enhance the reliability of model evaluation. This method not only improves the accuracy of model evaluation but also provides a more reliable basis for model optimization. 3. **Generalization Ability across Data Sources**: - The research also includes an evaluation of a practical scenario, that is, the performance of a model trained on one data source on a dataset from another different source. This helps to improve the reliability and practicality of ADMET prediction and ensures the stability and consistency of the model on different datasets. 4. **Impact of External Data**: - The research explores how the model performance changes when external data is combined with internal data for training. This helps to understand how external data affects the model's performance and provides guidance for future research. 5. **Data Cleaning**: - In response to common data quality problems in public ADMET datasets (such as inconsistent SMILES representations, duplicate measurements, etc.), the author performs detailed data - cleaning steps to ensure the accuracy and consistency of the data. 6. **Selection of Algorithms and Feature Representations**: - The research compares different machine - learning algorithms (such as CatBoost, SVM, LightGBM, etc.) and compound representation methods (such as RDKit descriptors, Morgan fingerprints, etc.) and explores the impact of these choices on the ADMET prediction task. ### Summary Through systematic feature selection, improved model evaluation methods, and generalization ability tests across data sources, this paper aims to improve the reliability and accuracy of ligand - based machine - learning models in ADMET prediction. At the same time, the research also explores the impact of the introduction of external data on model performance, providing a valuable reference for future drug discovery projects. If you need further information about specific experimental designs, results, or formula derivations, please let me know and I will continue to provide you with a detailed interpretation.