Nm-Nano: a machine learning framework for transcriptome-wide single-molecule mapping of 2 ́-O-methylation (Nm) sites in nanopore direct RNA sequencing datasets

Doaa HassanAditya AriyurSwapna Vidhur DaulatabadQuoseena MirSarath Chandra Jangaa Department of Biohealth Informatics,Luddy School of Informatics,Computing,and Engineering,Indiana University Indianapolis (IUI),Indianapolis,Indiana,USAb Computers and Systems Department,National Telecommunication Institute,Cairo,Egyptc Centre for Computational Biology and Bioinformatics,Indiana University School of Medicine,Indianapolis,Indiana
DOI: https://doi.org/10.1080/15476286.2024.2352192
2024-05-19
RNA Biology
Abstract:2 ́-O-methylation (Nm) is one of the most abundant modifications found in both mRNAs and noncoding RNAs. It contributes to many biological processes, such as the normal functioning of tRNA, the protection of mRNA against degradation by the decapping and exoribonuclease (DXO) protein, and the biogenesis and specificity of rRNA. Recent advancements in single-molecule sequencing techniques for long read RNA sequencing data offered by Oxford Nanopore technologies have enabled the direct detection of RNA modifications from sequencing data. In this study, we propose a bio-computational framework, Nm-Nano, for predicting the presence of Nm sites in direct RNA sequencing data generated from two human cell lines. The Nm-Nano framework integrates two supervised machine learning (ML) models for predicting Nm sites: Extreme Gradient Boosting (XGBoost) and Random Forest (RF) with K-mer embedding. Evaluation on benchmark datasets from direct RNA sequecing of HeLa and HEK293 cell lines, demonstrates high accuracy (99% with XGBoost and 92% with RF) in identifying Nm sites. Deploying Nm-Nano on HeLa and HEK293 cell lines reveals genes that are frequently modified with Nm. In HeLa cell lines, 125 genes are identified as frequently Nm-modified, showing enrichment in 30 ontologies related to immune response and cellular processes. In HEK293 cell lines, 61 genes are identified as frequently Nm-modified, with enrichment in processes like glycolysis and protein localization. These findings underscore the diverse regulatory roles of Nm modifications in metabolic pathways, protein degradation, and cellular processes. The source code of Nm-Nano can be freely accessed at https://github.com/Janga-Lab/Nm-Nano.
biochemistry & molecular biology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop a new bio - computing framework named Nm - Nano for predicting 2'-O - methylation (Nm) sites in whole - transcriptome single - molecule mapping in nanopore direct RNA sequencing data. Specifically, the paper aims to: 1. **Improve detection accuracy**: By combining machine - learning techniques, especially the Extreme Gradient Boosting (XGBoost) and Random Forest models, as well as the K - mer embedding method, improve the detection accuracy of Nm sites. 2. **Overcome the limitations of existing methods**: Existing experimental methods such as RiboMethSeq and RibOxi - seq can detect Nm sites, but they have disadvantages such as requiring a large amount of input RNA, being costly, and time - consuming. And the existing computational methods are mainly based on short - read data and cannot fully utilize the advantages of long - read data. Nm - Nano aims to use the long - read data generated by nanopore sequencing technology to provide a more efficient and accurate method for predicting Nm sites. 3. **Expand the scope of application**: Existing nanopore technology tools such as nanoRMS have only been tested in yeast cells, while Nm - Nano is applied to human cell lines (HeLa and HEK293) to identify Nm - modified genes in these cell lines and analyze their functional enrichment. 4. **Reveal the function of Nm modification**: By identifying genes that are frequently Nm - modified in HeLa and HEK293 cell lines, study the roles of these genes in biological processes such as immune response, metabolic pathways, and protein localization, thereby further understanding the biological significance of Nm modification. ### Main contributions of the paper - **High - performance prediction model**: Nm - Nano has achieved accuracies of 99% and 92% on the benchmark datasets of HeLa and HEK293 cell lines respectively. - **Feature selection and optimization**: By analyzing the impact of different features on model performance, it was found that the position feature contributes the most to the accuracy of the classifier, followed by the model mean and K - mer matching features. - **Biological analysis**: By deploying Nm - Nano, 125 genes that are frequently Nm - modified in the HeLa cell line were identified, and these genes are enriched in immune response and cell processes; 61 genes that are frequently Nm - modified in the HEK293 cell line were identified, and these genes are enriched in glycolysis and protein localization processes. ### Conclusion Nm - Nano not only performs excellently in terms of the accuracy of detecting Nm sites, but also can effectively use the long - read data generated by nanopore sequencing technology, providing new tools and methods for the study of RNA modification. Through its application in human cell lines, this study has revealed the important roles of Nm modification in multiple biological processes, providing an important reference for further biological research.