Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data

Md. Rezaul Karim,Galih Wicaksono,Ivan G. Costa,Stefan Decker,Oya Beyan
DOI: https://doi.org/10.1109/access.2019.2941796
IF: 3.9
2019-01-01
IEEE Access
Abstract:Cancer is one of the deadliest diseases caused by abnormal behaviors of genes that control the cell division and growth. Genomics data and clinical outcomes from multiplatform and heterogeneous sources are used to make clinical decisions for the cancer patients, where both multimodality and heterogeneity impose significant challenges to bioinformatics tools and algorithms. Numerous works have been proposed to overcome these challenges by using sophisticated bioinformatics and machine learning algorithms as either primary or supporting tools. In this paper, we propose a new approach to analyze genomics data from The Cancer Genome Atlas (TCGA) to classify breast cancer patients based on their subtypes and survival rates. Since multiple factors such as estrogen receptor (ER), progesterone receptor (PGR), and human epidermal growth factor receptor 2 (HER2) statuses are involved in breast cancer diagnosis, we used DNA methylation, gene expression (GE), and miRNA expression data by creating a multiplatform network called Multimodal Autoencoders (MAE) classifier to support each data type. Experiment results demonstrate that our approach is promising with high confidence for predicting both breast cancer subtypes and survival rates. In particular, we achieved state-of-the-art results with accuracies of 91% and 86%, respectively for the ER and PGR-based subtype prediction and moderately low accuracy for the HER2-based subtype prediction as well as we perceived reasonably low MSE and positive coefficient of determination (R<sup>2</sup>) scores in case of survival prediction. Additionally, we created unimodal and multimodal features from each input type and trained decision tree (DT), Naive Bayes (NB), K-nearest neighbors (KNN), logistic regression (LR), support vector machine (SVM), random forest (RF), and gradient boosting trees (GBT) as ML baseline models. Finally, we use the model averaging ensemble of top-3 models to report the final prediction.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?