Identifying Cancer Biomarkers from High-Throughput RNA Sequencing Data by Machine Learning

Zishuang Zhang,Zhi-Ping Liu
DOI: https://doi.org/10.1007/978-3-030-26969-2_49
2019-01-01
Abstract:In cancer progression, the expression level of relevant genes will change significantly in tumors comparing to their healthy counterparts. Therefore, the discovery of specific genes serving as biomarkers is of practical significance for diagnosis and prognosis. The available high-throughput ‘-omic’ datasets provide unprecedented resources and opportunities of deriving cancer biomarkers, such as the public RNA-sequencing data generated by the Cancer Genome Atlas (TCGA) consortium. Here, we explore the identification of biomarker genes in 12 types of cancers from the classification effects in control and disease samples by machine learning. We firstly identify differentially expressed genes individually. Then, we implement feature selection by integrating recursive feature reduction and random forest classification with feature ranking. The final feature number will be determined via a parsimony principle that the features will be as few as possible, while they are still with the highest classification accuracy. In each cancer, the biomarker genes are then evaluated by tenfold cross-validations via several classification algorithms. We find extreme learning machine achieves the best classification performance when compared to the other methods. The further gene enrichment analyses indicate the dysfunctional and pathogenic mechanism in these identified biomarkers.
What problem does this paper attempt to address?