Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization

Chuanxin Jin,Tinghuai Ma,Rongtao Hou,Meili Tang,Yuan Tian,Abdullah Al-Dhelaan,Mznah Al-Rodhaan
DOI: https://doi.org/10.1080/03772063.2015.1021385
IF: 1.8768
2015-01-01
IETE Journal of Research
Abstract:Text categorization (TC) becomes the key technology to find relevant and timely information from a volume of digital documents, and feature selection techniques are proposed to overcome the high dimensionality which causes the high computational complexity and low accuracy in TC tasks. Chi-square statistics (CHI) is one of the most efficient feature selection methods; however, it has two weaknesses. (1) It is document frequency based, and only counts whether the term occurs or not. Actually, high-frequency term occurring in few documents is often regarded as a discriminator in corpus. (2) It does not consider the term distribution. A term has more discriminating power for a specific category when its difference in degree of distribution is lower. In this paper, we propose a modified CHI feature selection approach which is called term frequency and distribution based CHI to overcome these weaknesses. We use sample variance to calculate the term distribution, and improve the classic CHI with maximum term frequency. Extensive and comparative experiments on three corpora show that the proposed approach is comparable to the classic feature selection methods in terms of macro-F1 and micro-F1.
What problem does this paper attempt to address?