Abstract:Abstract Background Hepatocellular carcinoma (HCC) is one of the leading causes of cancer death in the world owing to limitations in its prognosis. The current prognosis approaches include radiological examination and detection of serum biomarkers, however, both have limited efficiency and are ineffective in early prognosis. Due to such limitations, we propose to use RNA-Seq data for evaluating putative higher accuracy biomarkers at the transcript level that could help in early prognosis. Methods To identify such potential transcript biomarkers, RNA-Seq data for healthy liver and various HCC cell models were subjected to five different machine learning algorithms: random forest, K-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. Various metrics, namely sensitivity, specificity, MCC, informedness, and AUC-ROC (except for support vector machine) were evaluated. The algorithms that produced the highest values for all metrics were chosen to extract the top features that were subjected to recursive feature elimination. Through recursive feature elimination, the least number of features were obtained to differentiate between the healthy and HCC cell models. Results From the metrics used, it is demonstrated that the efficiency of the known protein biomarkers for HCC is comparatively lower than complete transcriptomics data. Among the different machine learning algorithms, random forest and support vector machine demonstrated the best performance. Using recursive feature elimination on top features of random forest and support vector machine three transcripts were selected that had an accuracy of 0.97 and kappa of 0.93. Of the three transcripts, two were protein coding (PARP2–202 and SPON2–203) and one was a non-coding transcript (CYREN-211). Lastly, we demonstrated that these three selected transcripts outperformed randomly taken three transcripts (15,000 combinations), hence were not chance findings, and could then be an interesting candidate for new HCC biomarker development. Conclusion Using RNA-Seq data combined with machine learning approaches can aid in finding novel transcript biomarkers. The three biomarkers identified: PARP2–202, SPON2–203, and CYREN-211, presented the highest accuracy among all other transcripts in differentiating the healthy and HCC cell models. The machine learning pipeline developed in this study can be used for any RNA-Seq dataset to find novel transcript biomarkers. Code: www.github.com/rajinder4489/ML_biomarkers

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

Comparative Study of Cancer Classification by Analysis of RNA-seq Gene Expression Levels

A comparison of deep learning-based pre-processing and clustering approaches for single-cell RNA sequencing data

The efficacy of various machine learning models for multi-class classification of RNA-seq expression data

Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction

Application of RNA Processing Factors for Predicting Clinical Outcomes in Colon Cancer.

Analyzing RNA-Seq Gene Expression Data Using Deep Learning Approaches for Cancer Classification

Unifying cancer and normal RNA sequencing data from different sources

Assessing the Impact of Data Preprocessing on Analyzing Next Generation Sequencing Data

Machine Learning Analysis of RNA-seq Data for Diagnostic and Prognostic Prediction of Colon Cancer

Comparison of RNA-Seq and microarray in the prediction of protein expression and survival prediction

Benchmarking UMI-based single cell RNA-sequencing preprocessing workflows

Deep learning-based cancer survival prognosis from RNA-seq data: approaches and evaluations

Comparison of RNA-seq and microarray-based models for clinical endpoint prediction

Uncovering the roles of microRNAs/lncRNAs in characterising breast cancer subtypes and prognosis

Classifying Lung Adenocarcinoma and Squamous Cell Carcinoma Using RNA-Seq Data

Identifying novel transcript biomarkers for hepatocellular carcinoma (HCC) using RNA-Seq datasets and machine learning

Characterization of RNA Processing Genes in Colon Cancer for Predicting Clinical Outcomes

Variability in estimated gene expression among commonly used RNA-seq pipelines

Predictors of breast cancer cell types and their prognostic power in breast cancer patients

Identifying and Analyzing Different Cancer Subtypes Using RNA-seq Data of Blood Platelets.