Transcriptome-wide single molecule mapping of 2´-O-Methylation (Nm) sites in Nanopore direct RNA sequencing datasets using the Nm-nano framework

Aditya Ariyur,S. Janga,Janga,Doaa Hassan,Quoseena Mir,S. V. Daulatabad,Sarath Chandra
Abstract:Nm (2´-O-methylation) is one of the most abundant modifications of mRNAs and non-coding RNAs occurring when a methyl group (–CH3) is added to the 2´ hydroxyl (–OH) of the ribose moiety. This modification can appear on any nucleotide (base) regardless of the type of nitrogenous base, because each ribose sugar has a hydroxyl group and so 2´-O-methyl ribose can occur on any base. Nm modification has a great contribution in many biological processes such as the normal functioning of tRNA, the protection of mRNA against degradation by DXO, and the biogenesis and specificity of rRNA. Recently, the single-molecule sequencing techniques for long reads of RNA sequences data offered by Oxford Nanopore technologies have enabled the direct detection of RNA modifications on the molecule that is being sequenced, but to our knowledge there was only one research attempt that applied this technology to predict the stoichiometry of Nm-modified sites in RNA sequence of yeast cells. To this end, in this paper, we extend this research direction by proposing a bio-computational framework, Nm-Nano for predicting Nm sites in Nanopore direct RNA sequencing reads of human cell lines, which are more complex and larger than yeast. Nm-Nano framework integrates two supervised machine learning (ML) models for predicting Nm sites in Nanopore sequencing data, namely the Extreme Gradient Boosting (XGBoost) and Random Forest (RF) with k-mers embedding models. The XGBoost is trained with the features extracted from the modified and unmodified Nanopore signals and their corresponding K-mers resulting from the reported underlying RNA sequence obtained by base-calling, while RF model is trained with the same set of features used to train the XGBoost, in addition to a dense vector representation of RNA k-mers generated by word2vec technique. The results on two benchmark data sets generated from RNA Nanopore sequencing data of Hela and Hek293 human cell lines show a great performance of Nm-Nano. In independent validation testing, Nm-Nano has been able to identify Nm sites with a high accuracy of 93% and 88% using XGBoost and RF with k-mers embedding models respectively by training each model on the Hela benchmark dataset and testing it for identifying Nm sites on Hek293 benchmark dataset. Deploying Nm-Nano to predict Nm sites in Hela cell line revealed that a total of 196 genes were identified as the top frequently Nm-modified genes among all other genes that have been modified by Nm sites in this cell line. The functional and gene set enrichment analysis on these identified genes shows a significant enrichment of a wide range of functional processes in Hela cell line like high confidences (adjusted p-val < 0.05) enriched ontologies that were more representative of Nm modification role in immune response and cellular homeostasis. Similarly, deploying Nm-Nano to predict Nm sites in Hek293 cell line revealed that a total of 176 genes were identified as the top frequently Nm-modified genes in this cell line. The functional and gene set enrichment analysis on these identified genes shows a significant enrichment of a wide range of functional processes in Hek293 cell line like “MHC class 1 protein complex”, “mitotic spindle assembly”, “response to glucocorticoid”, and “nucleocytoplasmic transport”. The source code of Nm-Nano can be
Computer Science,Biology,Materials Science
What problem does this paper attempt to address?