Using modified term frequency to improve term weighting for text classification
Long Chen,Liangxiao Jiang,Chaoqun Li
DOI: https://doi.org/10.1016/j.engappai.2021.104215
IF: 8
2021-05-01
Engineering Applications of Artificial Intelligence
Abstract:<p>Text classification (TC) is an essential task of natural language processing (NLP). In order to improve the performance of TC, term weighting is often used to obtain effective text representation by assigning appropriate weights to each term. A term weighting scheme is generally composed of term frequency factor, collection frequency factor and normalization factor. The normalization factor is commonly used as an optional factor to offset the influence of document length. Through the investigation of the existing term weighting schemes, we found that most of them focus on finding a more effective collection frequency factor, but rarely pay attention to finding a new term frequency factor. In this paper, we first proposed a new term frequency factor called modified term frequency (MTF). Different from the normalization factor, MTF directly modifies the raw term frequency based on the length information of all training documents. Then we proposed a new term weighting scheme by combining MTF with an existing collection frequency factor called modified distinguishing feature selector (MDFS). We denoted our scheme by MTF-MDFS (MDFS-based MTF). Extensive experimental results on 19 benchmark text datasets and 6 real-world text datasets show that our proposed MTF and MTF-MDFS are all much better than their state-of-the-art competitors in terms of the classification accuracy and the weighted average of <span class="math"><math>F1</math></span> of widely used base classifiers, such as MNB, SVM and LR.</p>
automation & control systems,computer science, artificial intelligence,engineering, electrical & electronic, multidisciplinary