Abstract:With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models targeted towards the efficient prediction of hepatotoxicity of query compounds. The predictivity of each of these models was evaluated on a large number of test set compounds. Additionally, the best-performing model was used to screen a true external set of data. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP, and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book (https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges encountered by traditional modeling methods in handling small datasets and predicting external datasets in the prediction of hepatotoxicity of drugs and drug - like molecules. Specifically, the authors introduced a new modeling framework - Classification - Read - Across Structure - Activity Relationship (c - RASAR) to overcome the limitations of existing methods (such as the traditional Quantitative Structure - Activity Relationship (QSAR) model) in predicting drug - induced liver injury (DILI). These limitations include the scarcity of experimental data, the complexity of the model, and insufficient prediction performance on external validation sets. By combining the similarity - based Read - Across method with machine - learning techniques, the c - RASAR model aims to use the information of similar compounds to improve the prediction ability and interpretability of the model. In addition, this study also explored different dimensionality - reduction techniques (such as t - SNE and UMAP) and the ARKA framework for supervised learning to enhance the model's ability to identify activity cliffs in the dataset, that is, pairs of compounds that are very similar in structure but have significant differences in activity. In summary, the main goal of this paper is to develop a simpler, reproducible, transferable, and highly predictive model for efficiently predicting the hepatotoxicity of drugs and drug - like molecules. This will not only help reduce the cost and time of experimental evaluation but also improve the understanding of drug safety, thereby promoting the drug - development process.

From chemical similarity measures to an unconventional modeling framework: The application of c-RASAR along with dimensionality reduction techniques in a representative hepatotoxicity dataset

The application of chemical similarity measures in an unconventional modeling framework c-RASAR along with dimensionality reduction techniques to a representative hepatotoxicity dataset

Machine learning-assisted c-RASAR modeling of a curated set of orally active nephrotoxic drugs: Similarity-based predictions from close source neighbors

Molecular Similarity in Predictive Toxicology with a Focus on the q-RASAR Technique

Structure‐activity Relationship Approaches and Applications

Quantitative Read-Across Structure-Activity Relationship (q-RASAR): A novel approach to estimate the subchronic oral safety (NOAEL) of diverse organic chemicals in rats

Molecular similarity in chemical informatics and predictive toxicity modeling: from quantitative read-across (q-RA) to quantitative read-across structure–activity relationship (q-RASAR) with the application of machine learning

Quantitative Structure–activity Relationship: Promising Advances in Drug Discovery Platforms

ARKA: A framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data

Machine learning-based q-RASAR predictions of the bioconcentration factor of organic molecules estimated following the organisation for economic co-operation and development guideline 305

How Precise Are Our Quantitative Structure-Activity Relationship Derived Predictions for New Query Chemicals?

Prediction of the Aquatic Toxicity of Aromatic Compounds to Tetrahymena Pyriformis Through Support Vector Regression

Initial Development of Automated Machine Learning-Assisted Prediction Tools for Aryl Hydrocarbon Receptor Activators

A natural language processing approach based on embedding deep learning from heterogeneous compounds for quantitative structure–activity relationship modeling

High Throughput Read-Across for Screening a Large Inventory of Related Structures by Balancing Artificial Intelligence/Machine Learning and Human Knowledge

Explainable AI and tree-based ensemble models: a comparative study in predicting chemical pulmonary toxicity

Interpretable deep-learning pKa prediction for small molecule drugs via atomic sensitivity analysis

Predicting Chemical Immunotoxicity through Data-Driven QSAR Modeling of Aryl Hydrocarbon Receptor Agonism and Related Toxicity Mechanisms

Exploring QSAR models for activity-cliff prediction

Validating ADME QSAR Models Using Marketed Drugs

QSAR modelling of a large imbalanced aryl hydrocarbon activation dataset by rational and random sampling and screening of 80,086 REACH pre-registered and/or registered substances