Applications of Statistical Models in Chinese Text Mining

Jian WANG,Jun-ni ZHANG
DOI: https://doi.org/10.13860/j.cnki.sltj.20170123-012
2017-01-01
Abstract:This paper discusses three problems in Chinese text mining,including word segmentation,keyword extraction and text classification.For the word segmentation problem,we introduce the ICTCLAS method that is based on a hierarchical hidden Markov model,and the WDM method that treats the segmentation between words as missing data and uses the EM algorithm to find the solution.For the keyword extraction problem,we propose a method based on Bayes Factor,and introduce the CCS method that uses sparse regression.For the text classification problem,we introduce a method that builds classifiers on keyword frequencies,and another method that first trains topic models and then builds classifiers on topic proportions.This paper then compares the above methods using two text datasets,and offers suggestions on their practical use.
What problem does this paper attempt to address?