Chinese Documents Classification Based on N-Grams

Shuigeng Zhou,Jihong Guan
DOI: https://doi.org/10.1007/3-540-45715-1_43
2002-01-01
Abstract:Traditional Chinese documents classifiers are based on keywords in the documents, which need dictionaries support and efficient segmentation procedures. This paper explores the techniques of utilizing N-gram information to categorize Chinese documents so that the classifier can shake off the burden of large dictionaries and complex segmentation processing, and subsequently be domain and time independent. A Chinese documents classification system following above described techniques is implemented with Naive Bayes, kNN and hierarchical classification methods. Experimental results show that our system can achieve satisfactory performance, which is comparable with other traditional classifiers.
What problem does this paper attempt to address?