From chemical similarity measures to an unconventional modeling framework: The application of c-RASAR along with dimensionality reduction techniques in a representative hepatotoxicity dataset

Kunal Roy,Arkaprava Banerjee
DOI: https://doi.org/10.26434/chemrxiv-2024-b4rln
2024-07-22
Abstract:With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models targeted towards the efficient prediction of hepatotoxicity of query compounds. The predictivity of each of these models was evaluated on a large number of test set compounds. Additionally, the best-performing model was used to screen a true external set of data. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP, and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book (https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered by traditional modeling methods in handling small datasets and predicting external datasets in the prediction of hepatotoxicity of drugs and drug - like molecules. Specifically, the authors introduced a new modeling framework - Classification - Read - Across Structure - Activity Relationship (c - RASAR) to overcome the limitations of existing methods (such as the traditional Quantitative Structure - Activity Relationship (QSAR) model) in predicting drug - induced liver injury (DILI). These limitations include the scarcity of experimental data, the complexity of the model, and insufficient prediction performance on external validation sets. By combining the similarity - based Read - Across method with machine - learning techniques, the c - RASAR model aims to use the information of similar compounds to improve the prediction ability and interpretability of the model. In addition, this study also explored different dimensionality - reduction techniques (such as t - SNE and UMAP) and the ARKA framework for supervised learning to enhance the model's ability to identify activity cliffs in the dataset, that is, pairs of compounds that are very similar in structure but have significant differences in activity. In summary, the main goal of this paper is to develop a simpler, reproducible, transferable, and highly predictive model for efficiently predicting the hepatotoxicity of drugs and drug - like molecules. This will not only help reduce the cost and time of experimental evaluation but also improve the understanding of drug safety, thereby promoting the drug - development process.