Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers

Jnanendra Prasad Sarkar,Indrajit Saha,Anasua Sarkar,Ujjwal Maulik
DOI: https://doi.org/10.1016/j.compbiomed.2021.104244
Abstract:Breast cancer is the second leading cancer type among females. In this regard, it is found that microRNAs play an important role by regulating the gene expressions at the post-transcriptional phase. However, identification of the most influencing miRNAs in breast cancer subtypes is a challenging task, while the recent advancement in Next Generation Sequencing techniques allows analyzing high throughput expression data of miRNAs. Thus, we have conducted this research with the help of NGS data of breast cancer in order to identify the most significant miRNA biomarkers. The selected miRNA biomarkers are highly associated with the multiple breast cancer subtypes. For this purpose, a two-phase technique, called Machine Learning Integrated Ensemble of Feature Selection Methods, followed by survival analysis, is proposed. In the first phase, we have selected the best among seven machine learning techniques based on classification accuracy using the entire set of features (in this case miRNAs). Subsequently, eight different feature selection methods are used separately in order to rank the features and validate each set of top features using the selected machine learning technique by considering a multi-class classification task of the breast cancer subtypes. In the second phase, based on the classification accuracy values, the top features from each feature selection method are considered to make an ensemble to provide further categorization of the miRNAs as 8*, 7* up to 1*. The 8* miRNAs provide the highest average classification accuracy of 86% after 10-fold cross-validation. Thereafter, 27 miRNAs are identified from the list that is confined within 8* to 4* miRNAs based on their importance in survival for breast cancer subtypes using Cox regression based survival analysis. Moreover, expression analysis, regulatory network analysis, protein-protein interaction analysis, KEGG pathway and gene ontology enrichment analysis are performed in order to validate biological significance of the proposed solution. Additionally, we have prepared a miRNA-protein-drug interaction network to identify possible drug for the selected miRNAs. Thus, our findings may be considered during a clinical trial for the treatment of breast cancer patients.
What problem does this paper attempt to address?