Study on influences of different Chinese word segmentation methods to text automatic classification based on LDA model

Xiangdong Li,Fan Gao,Cong Ding
DOI: https://doi.org/10.3969/j.issn.1001-3695.2017.01.012
2017-01-01
Abstract:From the perspective of corpus measure,which includes three indicators:the clustering density,the complexity and definition of category,this paper studied the influence of three representative Chinese word segmentation methods,including IC-TCLAS,Ik Analyzer and 2-gram,on the performance of text classification under the implicit probabilistic topic model LDA.Mo-reover,the applicability of different Chinese word segmentation methods in different types of texts such as Web and academic documents and its cause were analyzed qualitatively and quantitatively.Experiments show that three indexes can effectively in-dicate the influence of word segmentation method on the classification of texts:Ik Analyzer and ICTCLAS segmentation method are more influenced respectively by the complexity of the category and the clustering density of the category,for 2-gram,the in-fluences of three indexes are similar,so it has good adaptability for different corpus.For corpus of academic literature,2-gram has better performance,F1 values are above 80%.And the corpus of Web pages is more adaptive to different word segmentation methods.This paper provides a reference for the selection of appropriate Chinese word segmentation method in classification system based on LDA model for different types of texts such as Web pages and academic literature by means of corpus measure instead of by experiments only.
What problem does this paper attempt to address?