Unsupervised model for Microblog new words detection based on repeated string

Xiao SUN,Cheng-cheng LI,Jia-qi YE,Fu-ji REN
DOI: https://doi.org/10.3969/j.issn.1003-5060.2014.06.008
2014-01-01
Abstract:The characteristics of oral Microblogging text is studied to develop appropriate language rules ,and the statistics and rules based methods are combined based on the statistical characteristics of the repeated string .First ,the Microblogging corpus is segmented with the existing system dictionary . Then the new words that appear twice or more than twice are extracted from the sub-word fragments .Through the multi-layer filtering ,the candidate new words are recognized .The experimental results show that the method is ef-fective in ensuring higher levels of precision and recall rate as well as the extraction speed of the new words .
What problem does this paper attempt to address?