Novel Multikernel Trick for Predicting Pan-CancerDistant Metastatic Sites Using a Feature Extraction Strategy

Yining Xu,Liyuan Zhang,Xinran Cui,Tianyi Zhao,Yadong Wang
DOI: https://doi.org/10.1109/bibm52615.2021.9669335
2021-01-01
Abstract:Distant metastasis is the leading cause of cancer death. Identifying the tendency of a given cancer to metastasize could be conducive to cancer diagnosis and therapeutic schedules. In cancer studies, mRNA gene expression data have been widely used to predict cancer metastasis due to the ease with which they can be obtained. Moreover, mRNA gene expression data represent cancer progression directly and in detail. In these studies, feature extraction followed by a prediction model has been a commonly used solution to predict pan-cancer prognosis and tumor stage. Limitations of these studies include a lack of comprehensive feature extraction, relatively low prediction accuracy of cancer outcomes and a lack of precise pan-cancer metastasis site prediction.To address the questions mentioned above, we designed an innovative pipeline to determine the heterogeneity of pan-cancer distant metastatic sites using mRNA gene expression data. We used a directed relational graph convolutional network (DRGCN) for feature extraction and a multikernel support vector machine (SVM) for pan-cancer distant metastasis site prediction. DR-GCN successfully excavated hidden features from relational networks and effectively extracted features from gene-cancer relations, cancer-disease relations and gene-gene relational networks. DR-GCN was demonstrably able to deal with complex prior knowledge-based feature extraction tasks. A dynamic weight multikernel SVM was then applied to predict pan-cancer distant metastasis sites. By this method, the AUROC (0.7542) of the multikernel SVM outperformed that of the single kernel SVM (polykernel: 0.7346, RBF kernel: 0.72, linear kernel: 0. 725S). We last applied our pipeline to an extremely unbalanced small sample dataset and obtained a higher AUPRC (0.2606) than other semisupervised learning methods (Laplacian SVM: 0. 1S, TSVM: 0.21, SSL-EM: 0. 1S, RRLSL: 0.22) while predicting TCGA glioblastoma (GBM) patient prognosis.
What problem does this paper attempt to address?