A Text Document Clustering Method Based on Weighted BERT Model

Yutong Li,Juanjuan Cai,Jingling Wang
DOI: https://doi.org/10.1109/itnec48623.2020.9085059
2020-01-01
Abstract:Traditional text document clustering methods represent documents with uncontextualized word embeddings and vector space model, which neglect the polysemy and the semantic relation between words. This paper presents a novel text document clustering method to deal with these problems. Firstly, pre-trained language representation model Bidirectional Encoder Representations from Transformers (BERT) is utilized to generate sentence embeddings. Then, two sentence-level weighting schemes based on named entity are designed to enhance the performance. Finally, the k-means clustering algorithm is applied to find groups of similar documents. Experimental results on four datasets indicate that the proposed weighted method achieves higher accuracy than unweighted average method. Friedman tests conducted separately with F1 score and Adjusted Rand Index (ARI) values both validate better overall performance of our proposed method.
What problem does this paper attempt to address?