De-redundancy Relative Discrimination Criterion-based Feature Selection for Text Data

Lingbin Jin,Li Zhang
DOI: https://doi.org/10.1109/ijcnn55064.2022.9892781
2022-01-01
Abstract:High dimensionality of text data would degrade the performance of text classification for the existence of irrelevant terms. Thus, it is necessary to perform feature selection to remove irrelevant terms. As an effective feature selection method, Relative Discrimination Criterion (RDC) can identify the irrelevant terms according to discriminant information, but it is incapable of capturing redundancy between terms. Therefore, this paper proposes a novel text feature selection based on RDC, named De-redundancy Relative Discrimination Criterion (DRDC), which takes into account the redundancy between terms when assessing their importance. For one thing, DRDC takes advantage of RDC to measure the relevance of terms to categories. For another, DRDC utilizes mutual information to measure the redundancy between terms. During iterations, we separately normalize the scores of RDC and mutual information for balancing them and reducing the impact of marginal probabilities on mutual information. In the procedure of feature selection, DRDC iteratively picks up the term that has the maximum relevance to categories and the minimum redundancy with terms already in the current feature subset. In such a way, DRDC can find an optimal term subset. The effectiveness and efficiency of DRDC is clarified by experiments on R8 and 20Newsgroups data sets using three widely used classifiers.
What problem does this paper attempt to address?