Automatic Selection of Chinese Stoplist

Yi-jun GU,Xiao-Zhong FAN,Jian-hua WANG,Tao WANG,Wei-jin HUANG
DOI: https://doi.org/10.3969/j.issn.1001-0645.2005.04.014
2005-01-01
Abstract:By investigating the methods of automatically selecting stop words based on statistical methods, a new method is proposed. The idea of this method is to calculate the probability that the word occurs in each sentence of corpus, and calculate the probability that the sentences include the word occuring in corpus, then calculate the entropy of these probabilities, and select stop words according to the entropy. The stoplist determined by this method is compared with that determined by the traditional methods, the effects of various preprocessing methods on the categorization are compared also. The experiments show that the method is better in avoiding the impact of the style or manner of writing in corpus on choosing the stoplist, and more suitable for preprocessing the text categorization than traditional methods.
What problem does this paper attempt to address?