Abstract:In many data analysis tasks, one is often confronted with very high dimensional data. The feature selection problem is essentially a combinatorial optimization problem which is computationally expensive. On the one hand, to overcome this problem traditional feature selection methods frequently assume either that the features independently influence the class variable or do so only involving pairwise feature interactions. On the other hand, they attempt to select a common feature subset for all the clusters present in the data. However, in doing so they neglect the fact that different features may have different discriminating power for different classes present in data. To tackle the above problems, we propose a localized graph-based feature selection algorithm consisting of three steps, namely, i) based on the label information, we first construct a graph for each class of dataset in which each node corresponds to a feature, and each edge has a weight corresponding to the mutual information (MI) between features connected by that edge, ii) we then perform dominant set clustering for the graphs to select a highly coherent set of features, iii) we further refine the selected features based on a new measure called multidimensional interaction information (MII). The advantage of MII is that it can go beyond pairwise interaction and consider third or higher order feature interactions. Using dominant set clustering, which can extract the most informative features in the leading dominant set as a preprocessing step and in doing so we can limit the search space for higher order interactions. We use a variational EM (VBEM) algorithm to learn a Gaussian mixture model on the selected feature subset for clustering. Experimental results demonstrate the effectiveness of our localized feature selection method on a number of standard data-sets.

Model-based Clustering of High-Dimensional Data: Variable Selection Versus Facet Determination

Clustering and Prediction with Variable Dimension Covariates

Bayesian Clustering with Variable and Transformation Selections

Flexible Variable Selection for Clustering and Classification

Model-based multifacet clustering with high-dimensional omics applications

Localized graph-based feature selection for clustering

Feature Selection for Clustering on High Dimensional Data

Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering

Factor Adjusted Spectral Clustering for Mixture Models

A nonparametric variable clustering model

A Two-Stage Variable Selection Approach for Correlated High Dimensional Predictors

A sparse factor model for clustering high‐dimensional longitudinal data

Variable selection in model-based clustering and discriminant analysis with a regularization approach

A new model for natural groupings in high-dimensional data

MCEN: a Method of Simultaneous Variable Selection and Clustering for High-Dimensional Multinomial Regression

Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers

Bayesian approaches to variable selection in mixture models with application to disease clustering

Discriminative variable selection for clustering with the sparse Fisher-EM algorithm

Flexible Clustering by Tendency in High Dimensional Space

High-dimensional variable selection accounting for heterogeneity in regression coefficients across multiple data sources

Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions