Several alternative term weighting methods for text representation and classification

Zhong Tang,Wenqiang Li,Yan Li,Wu Zhao,Song Li
DOI: https://doi.org/10.1016/j.knosys.2020.106399
2020-11-01
Abstract:<p>Text representation is one kind of hot topics which support text classification (TC) tasks. It has a substantial impact on the performance of TC. Although the most famous TF-IDF is specially designed for information retrieval rather than TC tasks, it is highly useful in the field of TC as a term weighting method to represent text contents. Inspired by the IDF part of TF-IDF which is defined as the logarithmic transformation, we proposed several alternative methods in this study to generate unsupervised term weighting schemes that can offset the drawback confronting TF-IDF. Moreover, owing to TC tasks are different from information retrieval, representing test texts as a vector in an appropriate way is also essential for TC tasks, especially for supervised term weighting approaches (e.g., TF-RF), mainly due to these methods need to use category information when weighting the terms. But most of current schemes do not clearly explain how to represent test texts with their schemes. To explore this problem and seek a reasonable solution to these schemes, we analyzed three typical supervised term weighting methods in depth to illustrate how to represent the test text. To investigate the effectiveness of our work, three sets of experiments are designed to compare their performance. Comparisons show that our proposed methods can indeed enhance the performance of TC, and sometimes even outperform existing supervised term weighting methods.</p>
computer science, artificial intelligence
What problem does this paper attempt to address?