A Weibo-Oriented Method for Unknown Word Extraction

Shuai Zhang,Qianren Liu,Lei Wang
DOI: https://doi.org/10.1109/skg.2012.15
2012-01-01
Abstract:Unknown word recognition is one of the most prominent and challenging problems in the Chinese language processing. Some effective approaches have been proposed, however, they do not work well on Chinese twitter (i.e. weibo) messages. In this paper, a method is presented to recognize unknown words from weibo. Due to the great flexibility in wording and highly correlation between unknown words and unpredictable topics, which are exhibited in weibo messages, the proposed method firstly groups the corpus into multiple categories by using K-means, then, from each of the categories, a morpheme set is derived based on local terms frequencies. Secondly, as for each potential unknown word in every morpheme set, a newly introduced measure (named adjacency degree) is calculated to see if a correct unknown word is found. It could be shown by the experiments that the proposed method is efficient, precise, and insensitive to the size of the weibo corpus.
What problem does this paper attempt to address?