AdvanceSplice: Integrating N-gram one-hot encoding and ensemble modeling for enhanced accuracy

Mohammad Reza Rezvan,Ali Ghanbari Sorkhi,Jamshid Pirgazi,Mohammad Mehdi Pourhashem Kallehbasti
DOI: https://doi.org/10.1016/j.bspc.2024.106017
IF: 5.1
2024-02-19
Biomedical Signal Processing and Control
Abstract:Accurate splice site prediction is a critical challenge in genomics, essential for understanding gene expression and disease-associated mutations. Splice sites mark the boundaries between exons and introns in genetic sequences and are crucial for proper RNA splicing and protein synthesis. Splice site prediction faces challenges such as complex feature extraction and constraints in accuracy. This study introduces AdvanceSplice, a method that integrates two feature extraction approaches: N-gram One-hot Encoding and character-to-numerical encoding, and employs majority voting in Ensemble Modeling. The design of AdvanceSplice is focused on utilizing diversity in feature extraction to enhance the accuracy of splice site prediction. AdvanceSplice begins with N-gram processing for feature extraction, capturing essential patterns within DNA sequences. These N-grams are then transformed into binary images using one-hot encoding, which facilitates a more effective data representation for subsequent analysis. Alongside, character-to-numerical encoding is employed to enrich the analysis. In AdvanceSplice, four of the deep learning models are specialized in processing the image-like binary representations derived from N-gram encoding, while the fifth model processes sequence information through character-to-numerical encoding. This diversified approach allows for an extensive exploration of patterns and dependencies associated with various N-gram representations and sequence-based features. The ensemble strategy of AdvanceSplice combines predictions from all five models to enhance the overall accuracy of splice site identification. Comparisons with existing models on datasets such as HS3D, Homo Sapiens, and A. Thaliana indicate that AdvanceSplice identifies splice sites more effectively, contributing to the field of genomics and bioinformatics by improving splice site prediction.
engineering, biomedical
What problem does this paper attempt to address?