Abstract:Abstract Pseudouridine is one of the most abundant RNA modifications, occurring when uridines are catalyzed by Pseudouridine synthase proteins. It plays an important role in many biological processes and also has an importance in drug development. Recently, the single-molecule sequencing techniques such as the direct RNA sequencing platform offered by Oxford Nanopore technologies enable direct detection of RNA modifications on the molecule that is being sequenced, but to our knowledge this technology has not been used to identify RNA Pseudouridine sites. To this end, in this paper, we address this limitation by introducing a tool called Penguin that integrates several developed machine learning (ML) models (i.e., predictors) to identify RNA Pseudouridine sites in Nanopore direct RNA sequencing reads. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer. Those features are used to train the predictors included in Penguin, which in turn, is able to predict whether the signal is modified by the presence of Pseudouridine sites. We have included various predictors in Penguin including Support vector machine (SVM), Random Forest (RF), and Neural network (NN). The results on the two benchmark data sets show that Penguin is able to identify Pseudouridine sites with a high accuracy of 93.38% and 92.61% using SVM in random split testing and independent validation testing respectively. Thus, Penguin outperforms the existing Pseudouridine predictors in the literature that achieved an accuracy of 76.0 at most with an independent validation testing. A GitHub of the tool is accessible at https://github.com/Janga-Lab/Penguin . HIGHLIGHTS Penguin integrates several developed ML learning models (i.e., predictors) to identify RNA Ψ sites in Nanopore direct RNA sequencing reads. The pipeline of penguin automates the data preprocessing including Nanopore direct RNA reads alignment using Minimap2, and Nanopore signal extraction using Nanopolish, feature extraction from raw Nanopore signal for training ML predictors integrated in its platform, and the prediction of RNA Ψ sites with those predictors. Penguin can predict Ψ sites with a performance that outperforms the performance of the state-of-the-art research methods existing in the literature. Penguin platform can be adopted to be used for predicting other/various types of RNA modification. There are 6137606 U-mers samples predicted by penguin best ML model (SVM) as Ψ ones from a total of 67491289 U-mers samples in the complete Hek293 cell line with 556813 unique genomic location of Ψ. There are 1193192 U-mers samples predicted by penguin best ML model (SVM) as Ψ ones from a total of 229637931 U-mers samples in the complete Hela cell line with 39384 unique genomic locations of Ψ. There is a small fraction of 0.01% (6482 unique genomic locations) of Ψ that are common (overlapped) between both Hek293 and Hela cell lines. The extend of Ψ modification (the number of U-mers samples predicted as Ψ samples to the total number of U-mer samples in the complete RNA sequence of the cell line) in RNA sequence of Hek293 cell line is much greater than its counterpart for Hela cell line (9% for Hek293 versus 0. 5 % for Hela cell line).

PseUdeep: RNA Pseudouridine Site Identification with Deep Learning Algorithm

PseUpred-ELPSO Is an Ensemble Learning Predictor with Particle Swarm Optimizer for Improving the Prediction of RNA Pseudouridine Sites

PseU-Pred: An ensemble model for accurate identification of pseudouridine sites

Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data

A robust deep learning approach for identification of RNA 5-methyluridine sites

Deepm6A-MT: A deep learning-based method for identifying RNA N6-methyladenosine sites in multiple tissues

S2Snet: deep learning for low molecular weight RNA identification with nanopore

m5U-GEPred: prediction of RNA 5-methyluridine sites based on sequence-derived and graph embedding features

Fuzzy kernel evidence Random Forest for identifying pseudouridine sites

EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction

DeepOMe: A Web Server for the Prediction of 2'-O-Me Sites Based on the Hybrid CNN and BLSTM Architecture

m5UMCB: Prediction of RNA 5-methyluridine sites using multi-scale convolutional neural network with BiLSTM

Comprehensive Review and Assessment of Computational Methods for Predicting RNA Post-Transcriptional Modification Sites from RNA Sequences

Deep-2'-O-Me: Predicting 2'-O-methylation sites by Convolutional Neural Networks

i5hmCVec: Identifying 5-Hydroxymethylcytosine Sites of Drosophila RNA Using Sequence Feature Embeddings

EDLm6APred: ensemble deep learning approach for mRNA m6A site prediction

A novel RNA pseudouridine site prediction model using Utility Kernel and data-driven parameters

DLm6Am: A Deep-Learning-Based Tool for Identifying N6,2'-O-Dimethyladenosine Sites in RNA Sequences

Identifying piRNA targets on mRNAs in C. elegans using a deep multi-head attention network

BERMP: a Cross-Species Classifier for Predicting M6a Sites by Integrating a Deep Learning Algorithm and a Random Forest Approach.

DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions