Denoising Drug Discovery Data for Improved ADMET Property Prediction

Alan Cheng,Yunsie Chung,Matthew Adrian
DOI: https://doi.org/10.26434/chemrxiv-2024-v4jvc
2024-04-22
Abstract:Predicting ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of small molecules is a key task in drug discovery. A major challenge in building better ADMET models is the experimental error inherent in the data. Furthermore, ADMET predictors are typically regression tasks due to the continuous nature of the data. This makes it difficult to apply existing methods as most focus on classification tasks. Here, we develop denoising schemes based on deep learning to address this. We find that the training error can be used to identify the noise in regression tasks while ensemble-based and forgotten event-based metrics fail to detect the noise. The most significant performance increase occurs when the original model is finetuned with the denoised data using training error as the noise detection metric. Our method has the ability to improve models with medium noise and does not degrade the performance of models with noise outside this range. To our knowledge, our denoising scheme is the first to improve model performance for ADMET data and has implications for improving models for experimental assay data in general.
Chemistry
What problem does this paper attempt to address?
This paper aims to address the issue of data noise in drug discovery in order to improve the accuracy of predicting the ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties of small molecules. Predicting ADMET properties is a key task in the drug discovery process, but there are inherent errors in experimental data, posing challenges for building better prediction models. Due to the continuity of the data, ADMET prediction is usually a regression task, while most existing methods focus on classification tasks and are not applicable in this case. The paper proposes a deep learning-based data denoising method, which identifies noise in regression tasks by training errors, while metrics based on sets and forgetting events are unable to effectively detect noise. The study found that using training errors as a noise detection metric and fine-tuning the original model can significantly improve performance. This method has an improvement effect on data models with moderate noise levels and does not reduce the performance of models beyond this noise level range. The paper also discusses the impact of data imbalance, dataset size, and experimental errors in the test set on the denoising solution, and investigates whether noise in multi-task models spreads between different tasks and affects performance. To the best of the authors' knowledge, this is the first denoising solution proposed for ADMET data in drug discovery, which can improve the predictive performance of regression tasks.