Improving Online Clustering of Chinese Technology Web News with Bag-of-Near-Synonyms
Zhe Zhang,Le Chen,Fengjing Yin,Xin Zhang,Lixiang Guo
DOI: https://doi.org/10.1109/access.2020.2995516
IF: 3.9
2020-01-01
IEEE Access
Abstract:In the Internet era, online clustering of technology web news can help discover scientific breakthroughs and grasp technology trends. To do that automatically, the news documents to be clustered must be represented appropriately with numerical vectors. However, traditional representations such as Term Frequency-Inverse Document Frequency (TF-IDF) cannot distinguish near-synonyms and may cause “dimension disaster.” To overcome these problems, this article proposes the Bag-of-Near-Synonyms (BoNS) model based on the idea to construct near-synonym sets using word embeddings and agglomerative clustering, and then to represent a document with a Set Frequency-Inverse Document Frequency (SF-IDF) vector in which each dimension corresponds to a near-synonym set rather than a single word. To speed up computation, we further propose the hashed version of SF-IDF and name it hSF-IDF, which employs a hash function to map each near-synonym set to a unique number as the key and hence reduces the computation of SF to linear time. In addition, we apply hSF-IDF to online clustering of Chinese technology web news and propose an improved batch-based method. Extensive experiments have been conducted on a real-world dataset. The results show that our model outperforms some strong baselines including TF-IDF, average pooling of word or character embeddings, Latent Dirichlet Allocation (LDA), and bag-of-concepts in terms of both accuracy and efficiency.