Short text clustering based on word embeddings and EMD

Dong HUANG,Bo XU,Kan XU,Hong-fei LIN,Zhi-hao YANG
DOI: https://doi.org/10.6040/j.issn.1671-9352.1.2016.123
2017-01-01
Abstract:Short text clustering plays an important role in data mining. The traditional short text clustering model has some problems, such as high dimensionality、sparse data and lack of semantic information. To overcome the shortcomings of short text clustering caused by sparse features、semantic ambiguity、dynamics and other reasons, this paper presents a feature based on the word embeddings representation of text and short text clustering algorithm based on the moving distance of the characteristic words. Initially, the word embeddings that represents semantics of the feature word was gained through training in large-scale corpus with the Continous Skip-gram Model. Furthermore, use the Euclidean distance calculation feature word similarity. Additionally, EMD (Earth Mover's Distance) was used to calculate the similarity between the short text. Finally, apply the similarity between the short text to Kmeans clustering algorithm implemented in the short text clustering. The evaluation results on three data sets show that the effect of this method is superior to traditional clustering algorithms.
What problem does this paper attempt to address?