Performance evaluation of computational methods for splice-disrupting variants and improving the performance using the machine learning-based framework

Hao Liu,Jiaqi Dai,Ke Li,Yang Sun,Haoran Wei,Hong Wang,Chunxia Zhao,Dao Wen Wang
DOI: https://doi.org/10.1093/bib/bbac334
IF: 9.5
2022-08-19
Briefings in Bioinformatics
Abstract:A critical challenge in genetic diagnostics is the assessment of genetic variants associated with diseases, specifically variants that fall out with canonical splice sites, by altering alternative splicing. Several computational methods have been developed to prioritize variants effect on splicing; however, performance evaluation of these methods is hampered by the lack of large-scale benchmark datasets. In this study, we employed a splicing-region-specific strategy to evaluate the performance of prediction methods based on eight independent datasets. Under most conditions, we found that dbscSNV-ADA performed better in the exonic region, S-CAP performed better in the core donor and acceptor regions, S-CAP and SpliceAI performed better in the extended acceptor region and MMSplice performed better in identifying variants that caused exon skipping. However, it should be noted that the performances of prediction methods varied widely under different datasets and splicing regions, and none of these methods showed the best overall performance with all datasets. To address this, we developed a new method, machine learning-based classification of splice sites variants (MLCsplice), to predict variants effect on splicing based on individual methods. We demonstrated that MLCsplice achieved stable and superior prediction performance compared with any individual method. To facilitate the identification of the splicing effect of variants, we provided precomputed MLCsplice scores for all possible splice sites variants across human protein-coding genes (http://39.105.51.3:8090/MLCsplice/). We believe that the performance of different individual methods under eight benchmark datasets will provide tentative guidance for appropriate method selection to prioritize candidate splice-disrupting variants, thereby increasing the genetic diagnostic yield.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?