Abstract:Understanding the pathogenicity of missense mutation (MM) is essential for shed light on genetic diseases, gene functions, and individual variations. In this study, we propose a novel computational approach, called MMPatho, for enhancing missense mutation pathogenic prediction. First, we established a large-scale nonredundant MM benchmark data set based on the entire Ensembl database, complemented by a focused blind test set specifically for pathogenic GOF/LOF MM. Based on this data set, for each mutation, we utilized Ensembl VEP v104 and dbNSFP v4.1a to extract variant-level, amino acid-level, individuals' outputs, and genome-level features. Additionally, protein sequences were generated using ENSP identifiers with the Ensembl API, and then encoded. The mutant sites' ESM-1b and ProtTrans-T5 embeddings were subsequently extracted. Then, our model group (MMPatho) was developed by leveraging upon these efforts, which comprised ConsMM and EvoIndMM. To be specific, ConsMM employs individuals' outputs and XGBoost with SHAP explanation analysis, while EvoIndMM investigates the potential enhancement of predictive capability by incorporating evolutionary information from ESM-1b and ProtT5-XL-U50, large protein language embeddings. Through rigorous comparative experiments, both ConsMM and EvoIndMM were capable of achieving remarkable AUROC (0.9836 and 0.9854) and AUPR (0.9852 and 0.9902) values on the blind test set devoid of overlapping variations and proteins from the training data, thus highlighting the superiority of our computational approach in the prediction of MM pathogenicity. Our Web server, available at http://csbio.njust.edu.cn/bioinf/mmpatho/, allows researchers to predict the pathogenicity (alongside the reliability index score) of MMs using the ConsMM and EvoIndMM models and provides extensive annotations for user input. Additionally, the newly constructed benchmark data set and blind test set can be accessed via the data page of our web server.

The MBLOSUM: A Server for Deriving Mutation Targets and Position-specific Substitution Rates

MutationExplorer- a webserver for mutation of proteins and 3D visualization of energetic impacts

MyBASE: a Database for Genome Polymorphism and Gene Function Studies of Mycobacterium

MarkUs: a Server to Navigate Sequence-Structure-function Space.

Mgenomesubtractor: A Web-Based Tool For Parallel In Silico Subtractive Hybridization Analysis Of Multiple Bacterial Genomes

SDM: a server for predicting effects of mutations on protein stability

NeMu: a comprehensive pipeline for accurate reconstruction of neutral mutation spectra from evolutionary data

MutaRNA: analysis and visualization of mutation-induced changes in RNA structure

SMAL: A Resource of Spontaneous Mutation Accumulation Lines.

FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets

PTMsnp: A Web Server for the Identification of Driver Mutations That Affect Protein Post-translational Modification

Nabe: An Energetic Database Of Amino Acid Mutations In Protein-Nucleic Acid Binding Interfaces

MMPatho: Leveraging Multilevel Consensus and Evolutionary Information for Enhanced Missense Mutation Pathogenic Prediction

Quantification of the effect of mutations using a global probability model of natural sequence variation

MAESTROweb: a web server for structure-based protein stability prediction

Development of the protein virtual mutagenesis software for the site-directed and saturation mutagenesis

LOMETS2: improved meta-threading server for fold-recognition and structure-based function annotation for distant-homology proteins.

HotSpot3D Web Server: an Integrated Resource for Mutation Analysis in Protein 3D Structures

Accurate prediction of site- and amino-acid substitution rates with a mutation-selection model

PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels

ProteoMutaMetrics: machine learning approaches for solute carrier family 6 mutation pathogenicity prediction