Abstract:Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.

Exploring automated Feature Selection for Model-based and Density-based clustering with application to NCI 60 data

Bayesian Clustering with Variable and Transformation Selections

Feature Selection with Attributes Clustering by Maximal Information Coefficient

Ensemble feature selection with clustering for analysis of high-dimensional, correlated clinical data in the search for Alzheimer's disease biomarkers

Supervised clustering of high-dimensional data using regularized mixture modeling

A Novel Approach for Single Gene Selection Using Clustering and Dimensionality Reduction

A Supervised Feature Selection Method For Mixed-Type Data using Density-based Feature Clustering

Simultaneous Estimation of Number of Clusters and Feature Sparsity in Clustering High-Dimensional Data

A Feature Selection Framework Based on Supervised Data Clustering

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

A New Unsupervised Feature Selection Algorithm Using Similarity-Based Feature Clustering.

Unsupervised feature selection for multi-cluster data

A Novel Unsupervised Feature Selection Approach Using Genetic Algorithm on Partitioned Data

Limitations of Clustering with PCA and Correlated Noise

Feature Selection Based on Data Clustering

Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection

Exploring structural components in autoencoder-based data clustering

Clustering-based feature subset selection with analysis on the redundancy–complementarity dimension

Automatic Parameter Selection for Non-Redundant Clustering

Supervised Convex Clustering

Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm