Prediction of Drug-Induced Liver Injury: From Molecular Physicochemical Properties and Scaffold Architectures to Machine Learning Approaches

Yulong Zhao,Zhoudong Zhang,Xiaotian Kong,Kai Wang,Yaxuan Wang,Jie Jia,Huanqiu Li,Sheng Tian
DOI: https://doi.org/10.1111/cbdd.14607
Abstract:The process of developing new drugs is widely acknowledged as being time-intensive and requiring substantial financial investment. Despite ongoing efforts to reduce time and expenses in drug development, ensuring medication safety remains an urgent problem. One of the major problems involved in drug development is hepatotoxicity, specifically known as drug-induced liver injury (DILI). The popularity of new drugs often poses a significant barrier during development and frequently leads to their recall after launch. In silico methods have many advantages compared with traditional in vivo and in vitro assays. To establish a more precise and reliable prediction model, it is necessary to utilize an extensive and high-quality database consisting of information on drug molecule properties and structural patterns. In addition, we should also carefully select appropriate molecular descriptors that can be used to accurately depict compound characteristics. The aim of this study was to conduct a comprehensive investigation into the prediction of DILI. First, we conducted a comparative analysis of the physicochemical properties of extensively well-prepared DILI-positive and DILI-negative compounds. Then, we used classic substructure dissection methods to identify structural pattern differences between these two different types of chemical molecules. These findings indicate that it is not feasible to establish property or substructure-based rules for distinguishing between DILI-positive and DILI-negative compounds. Finally, we developed quantitative classification models for predicting DILI using the naïve Bayes classifier (NBC) and recursive partitioning (RP) machine learning techniques. The optimal DILI prediction model was obtained using NBC, which combines 21 physicochemical properties, the VolSurf descriptors and the LCFP_10 fingerprint set. This model achieved a global accuracy (GA) of 0.855 and an area under the curve (AUC) of 0.704 for the training set, while the corresponding values were 0.619 and 0.674 for the test set, respectively. Moreover, indicative substructural fragments favorable or unfavorable for DILI were identified from the best naïve Bayesian classification model. These findings may help prioritize lead compounds in the early stage of drug development pipelines.
What problem does this paper attempt to address?