Method of new Chinese words identification from large scale network corpora

Haijun ZHANG,Yong LI,Qiqi YAN
DOI: https://doi.org/10.3778/j.issn.1002-8331.1403-0103
2015-01-01
Abstract:The new words identification based on large scale corpora is a basis task in Chinese automatic processing. There are many difficulties because the study needs not only processing large scale corpora rapidly, but also requiring much intellectual methods. Based on lots of surveys and researches, it constructs a framework of new Chinese words iden-tification from large scale network corpora, which includes the repeat extraction algorithm based on hierarchical pruning, the new word detection method based on statistical learning and the POS guessing method based on combined features. Through lots of experiments and analyses, the framework can extract repeats from large scale corpora and construct the set of candidate new words rapidly, and can carry out the task of new words detecting and POS guessing with high effi-ciency and good results.
What problem does this paper attempt to address?