Abstract:Text processing tasks commonly grapple with the challenge of high dimensionality. One of the most effective solutions to this challenge is to preprocess text data through feature selection methods. Feature selection can select the most advantageous features for subsequent operations (e.g., classification) from the native feature space of the text. This process effectively trims the feature space's dimensionality, enhancing subsequent operations' efficiency and accuracy. This paper proposes a straightforward and efficient filter feature selection method based on document-term matrix unitization (DTMU) for text processing. Diverging from previous filter feature selection methods that concentrate on scoring criteria definition, our method achieves more optimal feature selection by unitizing each column of the document-term matrix. This approach mitigates feature-to-feature influence and reinforces the role of the weighting proportion within the features. Subsequently, our scoring criterion subtracts the sum of weights for negative samples from positive samples and takes the absolute value. We conduct numerical experiments to compare DTMU with four advanced filter feature selection methods: max–min ratio metric, proportional rough feature selector, least loss, and relative discrimination criterion, along with two classical filter feature selection methods: Chi-square and information gain. The experiments are performed on four ten-thousand-dimensional feature space datasets: book , dvd , music , movie and two thousand-dimensional feature space datasets: imdb , amazon_cells , sourced from Amazon product reviews and movie reviews. Experimental findings demonstrate that DTMU selects more advantageous features for subsequent operations and achieves a higher dimensionality reduction rate than those of the other six methods used for comparison. Moreover, DTMU exhibits robust generalization capabilities across various classifiers and dimensional datasets. Notably, the average CPU time for a single run of DTMU is measured at 1.455 s.

Text Classification Method Based on Normalized Document Frequency Feature Selection

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

Feature selection based on a normalized difference measure for text classification

Feature Selection Method on Imbalanced Text

Relative Term-Frequency Based Feature Selection for Text Categorization

An Effective Feature Selection Method For Text Categorization

Feature selection method based on backward cloud model in text classification

N-grams based feature selection and text representation for Chinese Text Classification

Improving Short Text Classification Through Better Feature Space Selection

Naive Bayes Based Criminal Text Classification of Unbalanced Classes

A comprehensive unsupervised feature selection method of two-stage strategy

Adapting Feature Selection Algorithms for the Classification of Chinese Texts

A New Approach of Feature Selection for Text Categorization

A General Framework of Feature Selection for Text Categorization

An Empirical Study on Feature Selection Methods for Centroid-based Text Classification

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

Learning Effective Features for Chinese Text Categorization

Two-step Based Feature Selection Method for Filtering Redundant Information

A simple and efficient filter feature selection method via document-term matrix unitization

Select Strong Information Features to Improve Text Categorization Effectiveness