MGCN-PolyA: An Integrated Computational Framework for Predicting Poly(A) Signals with Multiscale-gated Convolutional Networks

Jujuan Zhuang,Wanquan Gao,Xinru Huang,Guoyan Chen
DOI: https://doi.org/10.2174/0115748936289520240828050951
2024-09-19
Current Bioinformatics
Abstract:Background: The accurate recognition of the polyadenylation signal (PAS) from DNA sequences is essential for understanding gene transcriptional regulation. A variety of machine learning-based computational methods have been developed to predict PAS in recent years; however, their performance and their generalization ability are unsatisfactory. It is highly desirable to design more preferable computational approaches for PAS prediction. Methods: In this work, we developed an integrated framework MGCN-PolyA for PAS prediction across four species, including Homo sapiens, Bos taurus, Mus musculus, and Drosophila melanogaster. MGCN-Poly(A) benefits from the diversity of feature engineering and the effectiveness of the model architecture. We combined features from different perspectives, such as word embedding, One-hot encoding, K-mer frequency, and Enhanced Nucleic Acid Composition (ENAC), which complement each other and provide rich and comprehensive information for model learning. In model architecture, MGCN-Poly(A) leverages a two-channel multi-scale gated convolutional network to effectively learn high-level feature representations at different scales, and then combines the statistical features to predict PAS using random forest algorithm. These designs not only speed up network training, but also improves the generalization ability Results: The benchmarking experiments on the independent test datasets demonstrate that MGCNPolyA outperforms other state-of-the-art algorithms in identifying PAS. MGCN-PolyA has the highest accuracy on all test datasets, and its excellent performance on cross-species validation also demonstrates the robustness of our model. Conclusion: Extracting features from different perspectives is important for PAS recognition, and the integration of DNNs and shallow machine learning algorithms can improve the model performance.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?