Categorizing Web Information on Subject with Statistical Language Modeling

Xindong Zhou,Ting Wang,Huiping Zhou,Huowang Chen
DOI: https://doi.org/10.1007/978-3-540-30480-7_41
2004-01-01
Abstract:With the rapid growth of the available information on the Internet. it is more difficult for us to find the relevant information quickly on the Web. Text classification, one of the most useful web information processing tools. has been paid more and more attention recently. Instead of using traditional classification models, we apply n-gram language models to classify Chinese Web text information on subject. We investigate several factors that have important effect on the performance of n-gram models, including Various order n, different smoothing techniques, and different granularity of textual representation unit in Chinese. The experiment result indicates that bi-gram model based on word and tri-gram model based on character outperform others. achieving approximately 90% evaluated by F1 score.
What problem does this paper attempt to address?