Abstract:In the era of big data, dimensionality reduction plays an extremely important role in many fields driven by machine learning and data mining techniques. The existing information-theoretic feature selection algorithms generally reduce the dimension by selecting the features with maximum class-relevance and minimum redundancy, while relatively overlook the complementary correlation among features and sometimes deal with it improperly. This paper proposes a novel feature subset selection algorithm called the Clustering-based Feature Selection with Redundancy-Complementarity Analysis (CFSRCA). The proposed algorithm can be mainly divided into two steps, namely, (a) selecting the candidate class-relevant features, and (b) selecting the representative features. In the latter step, the representative features are defined as the features with minimum redundancy and maximum complementarity, and a clustering method based on the minimum spanning tree (MST) is proposed to distinguish them effectively. To validate the effectiveness of CFSRCA, three comparative feature selection algorithms (ReliefF, CFS, and FOU) and four well-known classifiers (C4.5, SVM, kNN, and NBC) are used to conduct classification experiments on eight datasets. Experimental results verify the effectiveness of the proposed feature subset algorithm.

A feature selection algorithm for document clustering based on word co-occurrence frequency

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

Document Clustering Using Locality Preserving Indexing

U^2F^2S^2 : Uncovering Feature-level Similarities for Unsupervised Feature Selection

Improving Short Text Classification Through Better Feature Space Selection

Multitype Features Coselection for Web Document Clustering

An Evaluation on Feature Selection for Text Clustering

Relative Term-Frequency Based Feature Selection for Text Categorization

Heuristic feature selection method for clustering

A New Unsupervised Feature Selection Algorithm Using Similarity-Based Feature Clustering.

A Clustering Algorithm for Short Documents Based On Concept Similarity

Feature clustering method based on distribution distance

Feature Selection Based on Data Clustering

Sparse Poisson coding for high dimensional document clustering

Cross-Lingual Document Clustering Based on Similarity Space Model

A Feature Selection Framework Based on Supervised Data Clustering

An Effective Feature Selection Method For Text Categorization

A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Clustering-based feature subset selection with analysis on the redundancy–complementarity dimension

A comprehensive unsupervised feature selection method of two-stage strategy

CWC: A Clustering-Based Feature Weighting Approach for Text Classification