Extracting Unknown Words From Sina Weibo Via Data Clustering
Kai Lei,Weiyang Zhang,Kai Zhang,Kuai Xu
DOI: https://doi.org/10.1109/ICC.2015.7248483
2015-01-01
Abstract:Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.