Identifying Diagnostic Biomarkers of Breast Cancer Based on Gene Expression Data and Ensemble Feature Selection
Lingyu Li,Yousif A. Algabri,Zhi-Ping Liu
DOI: https://doi.org/10.2174/1574893618666230111153243
2023-03-29
Current Bioinformatics
Abstract:Background: In recent years, the identification of biomarkers or signatures based on gene expression profiling data has attracted much attention in bioinformatics. The successful discovery of breast cancer (BRCA) biomarkers will be beneficial in reducing the risk of BRCA among patients for early detection. Methods: This paper proposes an Ensemble Feature Selection method to screen biomarkers (abbreviated as EFSmarker) for BRCA from publically available gene expression data. Firstly, we employ twelve filter feature selection methods, namely median, variance, Chi-square, Relief, Pearson and Spearman correlation, mutual information, minimal-redundancy-maximal-relevance criterion, ridge regression, decision tree and random forest with Gini index and accuracy index, to calculate the importance (weights or coefficients) of all features on the training dataset. Secondly, we apply the logistic regression classifier on the test dataset to calculate the classification AUC value of each feature subset individually selected by twelve methods. Thirdly, we provide an ensemble feature selection method by aggregating feature importance with classification AUC value. In particular, we establish a feature importance score (FIS) to evaluate the importance of each feature underlying all feature selection methods. Finally, the features with higher FIS are taken as identified biomarkers Results: With the direction of the FIS index induced by the EFSmarker method, 12 genes (COL10A1, COL11A1, MMP11, LOC728264, FIGF, GJB2, INHBA, CD300LG, IGFBP6, PAMR1, CXCL2 and FXYD1) are regarded as diagnostic biomarkers for BRCA. Especially, COL10A1, ranked first with a FIS value of 0.663, is identified as the most credible biomarker. The findings justified via gene and protein expression validation, functional enrichment analysis, literature checking and independent dataset validation verify the effectiveness and efficiency of these selected biomarkers. Conclusion: Our proposed biomarker discovery strategy not only utilizes the feature contribution but also considers the prediction accuracy simultaneously, which may also serve as a model for identifying unknown biomarkers for other diseases from high-throughput gene expression data. The source code and data are available at https://github.com/zpliulab/EFSmarker.
biochemical research methods,mathematical & computational biology