A Novel Computational Machine Learning Pipeline to Quantify Similarities in Three-Dimensional Protein Structures

Shreyas U Hirway,Xiao Xu,Fan Fan
DOI: https://doi.org/10.1101/2024.08.14.607969
2024-08-17
Abstract:Animal models are widely used during drug development. The selection of suitable animal model relies on various factors such as target biology, animal resource availability and legacy species. It is imperative that the selected animal species exhibit the highest resemblance to human, in terms of target biology as well as the similarity in the target protein. The current practice to address cross-species protein similarity relies on pair wise sequence comparison using protein sequences, instead of the biologically relevant 3-dimensional (3D) structure of proteins. We developed a novel quantitative machine learning pipeline using 3D structure-based feature data from the Protein Data Bank, nominal data from UNIPROT and bioactivity data from ChEMBL, all of which were matched for human and animal data. Using the XGBoost regression model, similarity scores between targets were calculated and based on these scores, the best animal species for a target was identified. For real-world application, targets from an alternative source, i.e., AlphaFold, were tested using the model, and the animal species that had the most similar protein to the human counterparts were predicted. These targets were then grouped based on their associated phenotype such that the pipeline could predict an optimal animal species.
Pharmacology and Toxicology
What problem does this paper attempt to address?
The paper attempts to address the problem of selecting the most suitable animal model to simulate human target proteins in the drug development process. Traditional cross-species protein similarity assessment methods mainly rely on pairwise comparisons of protein sequences rather than the biologically more relevant three-dimensional structures. Therefore, the authors developed a novel quantitative machine learning pipeline based on three-dimensional structural feature data, UNIPROT nominal data, and ChEMBL bioactivity data, using the XGBoost regression model to calculate similarity scores between targets and identify the animal species most suitable for specific targets based on these scores. Specifically, the study aims to: 1. **Utilize three-dimensional protein structures for cross-species comparison**: By integrating data from the Protein Data Bank (PDB), UNIPROT, and ChEMBL, construct a comprehensive, end-to-end computational pipeline to quantify protein similarity across different species. 2. **Improve the accuracy of animal model selection**: By calculating similarity scores of protein structural and functional attributes, predict the animal species most closely related to human target proteins, thereby optimizing the selection of animal models in drug development. 3. **Validate the model's effectiveness**: By comparing the model's prediction results with actual case study data, validate the model's accuracy and reliability in real-world applications. Overall, the goal of the paper is to develop a machine learning method based on three-dimensional protein structures to more accurately select suitable animal models for drug development.