New Word Identification in Social Network Text Based on Time Series Information

Meng Wang,Lanfen Lin,Feng Wang
DOI: https://doi.org/10.1109/cscwd.2014.6846904
2014-01-01
Abstract:Different from the languages widely used in western countries such as English or French, there are no spaces between words in Chinese language, and a segmentation of the texts is necessary before other superior processes. New word identification is an important problem in the segmentation process, especially when the segmentation targets are social network texts which have more abbreviated words or other non-standard representations. Several methods have been proposed to detect Chinese new words. Most of these methods take the corpus as a static set and they don't consider the time domain information. Different from these studies, we regard our social network corpus as a text series spreading along the time line and design a new kind of features named dynamic features which can reflect the temporal variety of the string's statistical features. The experimental results on the dataset crawled from the biggest microblogging application in China show that this method can significantly improve the effect of Chinese new word identification.
What problem does this paper attempt to address?