Research on new word identification

Dun Li,Fuyuan Cao,Yuanda Cao,Yueliang Wan
2007-01-01
Journal of Computational Information Systems
Abstract:The emergence of the new word is the proof that society changes with language, but automatic word segmentation, as the basic of the Chinese Information Processing, can't identify the new words accurately and segmented them into scattered single-character strings. In the paper, the characteristics of the new word were analyzed, the pre-processing of the texts using the laws of string frequency and time function was adopted to get the candidate string set, then an improved algorithm of association rules was proposed to identify the new word from the candidate string set according to the new word characteristics-adjacency, sequence and frequency. The experiments show that the method can distinguish new words from the common single-character strings, improve the rigidity of the traditional N-gram model matching, solve the short-in-long word problem and increase the precision of the new words identification.
What problem does this paper attempt to address?