An improvement to TF-IDF: Term Distribution based Term Weight Algorithm.

Tian Xia,Yanmei Chai
DOI: https://doi.org/10.4304/jsw.6.3.413-420
2011-01-01
Abstract:In the process of document formalization, term weight algorithm plays an important role. It greatly interferes the precision and recall results of the natural language processing (NLP) systems. Currently, TF-IDF term weight algorithm is widely applied into language models to build NLP Systems. Since term frequency is not the only discriminator which is necessary to be considered in term weighting and make each weight suitable to indicate the term's importance, we are motivated to investigate other statistical characteristics of terms and found an important discriminator: term distribution. Furthermore, we found that, in a single document, a term with higher frequency and close to hypodispersion distribution usually contains much semantic information and should be given higher weight. One the other hand, in a document collection, the term with higher frequency and hypo-dispersion distribution usually contains less information. Based on this hypothesis, by leveraging the Pearson Chisquare Test Statistic, a Term Distribution based Local Term Weight Algorithm and Global Term Weight Algorithm are put forward respectively in this paper. Also, the experiment results at the end of this paper approve the reliability and efficiency of the algorithms. © 2011 Academy Publisher.
What problem does this paper attempt to address?