A Novel Computational Machine Learning Pipeline to Quantify Similarities in Three-Dimensional Protein Structures

Shreyas U Hirway,Xiao Xu,Fan Fan

DOI: https://doi.org/10.1101/2024.08.14.607969

2024-08-17

Abstract:Animal models are widely used during drug development. The selection of suitable animal model relies on various factors such as target biology, animal resource availability and legacy species. It is imperative that the selected animal species exhibit the highest resemblance to human, in terms of target biology as well as the similarity in the target protein. The current practice to address cross-species protein similarity relies on pair wise sequence comparison using protein sequences, instead of the biologically relevant 3-dimensional (3D) structure of proteins. We developed a novel quantitative machine learning pipeline using 3D structure-based feature data from the Protein Data Bank, nominal data from UNIPROT and bioactivity data from ChEMBL, all of which were matched for human and animal data. Using the XGBoost regression model, similarity scores between targets were calculated and based on these scores, the best animal species for a target was identified. For real-world application, targets from an alternative source, i.e., AlphaFold, were tested using the model, and the animal species that had the most similar protein to the human counterparts were predicted. These targets were then grouped based on their associated phenotype such that the pipeline could predict an optimal animal species.

Pharmacology and Toxicology

What problem does this paper attempt to address?

The paper attempts to address the problem of selecting the most suitable animal model to simulate human target proteins in the drug development process. Traditional cross-species protein similarity assessment methods mainly rely on pairwise comparisons of protein sequences rather than the biologically more relevant three-dimensional structures. Therefore, the authors developed a novel quantitative machine learning pipeline based on three-dimensional structural feature data, UNIPROT nominal data, and ChEMBL bioactivity data, using the XGBoost regression model to calculate similarity scores between targets and identify the animal species most suitable for specific targets based on these scores. Specifically, the study aims to: 1. **Utilize three-dimensional protein structures for cross-species comparison**: By integrating data from the Protein Data Bank (PDB), UNIPROT, and ChEMBL, construct a comprehensive, end-to-end computational pipeline to quantify protein similarity across different species. 2. **Improve the accuracy of animal model selection**: By calculating similarity scores of protein structural and functional attributes, predict the animal species most closely related to human target proteins, thereby optimizing the selection of animal models in drug development. 3. **Validate the model's effectiveness**: By comparing the model's prediction results with actual case study data, validate the model's accuracy and reliability in real-world applications. Overall, the goal of the paper is to develop a machine learning method based on three-dimensional protein structures to more accurately select suitable animal models for drug development.

A Novel Computational Machine Learning Pipeline to Quantify Similarities in Three-Dimensional Protein Structures

Learning the Drug Target-Likeness of A Protein

A Similarity Computing Algorithm for Proteins

A Multiple Criteria Framework for 3D Protein Structure Similarity Retrieval

Reciprocal best structure hits: using AlphaFold models to discover distant homologues

Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions

Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data

Analysis of protein features and machine learning algorithms for prediction of druggable proteins

Three-dimensional protein shape similarity analysis based on hybrid features

AI-Based Protein Structure Prediction in Drug Discovery: Impacts and Challenges

Similarity-based machine learning methods for predicting drug-target interactions: a brief review.

Unraveling the role of physicochemical differences in predicting protein-protein interactions

Binding Affinity Prediction with 3D Machine Learning: Training Data and Challenging External Testing

On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction

Machine Learning for Sequence and Structure-Based Protein–Ligand Interaction Prediction

Comparative studies of AlphaFold, RoseTTAFold and Modeller: a case study involving the use of G-protein-coupled receptors

Machine Learning Scoring Functions for Drug Discoveries from Experimental and Computer-Generated Protein-Ligand Structures: Towards Per-Target Scoring Functions

Novel Big Data-Driven Machine Learning Models for Drug Discovery Application

Deep Learning-Based Modeling of Drug–Target Interaction Prediction Incorporating Binding Site Information of Proteins

Recent advances in interpretable machine learning using structure-based protein representations

A comparative study of available software for high-accuracy homology modeling: from sequence alignments to structural models