Extended Strategies For Document Clustering With Word Co-Occurrences

Yang Wei,Jinmao Wei,Zhenglu Yang
DOI: https://doi.org/10.1007/978-3-319-25255-1_38
2015-01-01
Abstract:To tackle the sparse data problem of the bag-of-words model for document clustering, recent strategies have been proposed to enrich a document with the relatedness of all the words in a corpus to the document, where the relatedness is estimated by the weighted sum of word co-occurrences. However, the relatedness is overestimated without eliminating the overlaps between word co-occurrences. This paper demonstrates that the weighted sum strategy gives the upper bound of the theoretic degree of relatedness. Two strategies are further proposed to approach the theoretic degree of relatedness. The first strategy is established under the extreme assumption that all the words in a document co-occur with each other. By considering the specificities of words, the second strategy gives several extended versions of the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the extended strategies achieve a significant performance improvement compared to the state-of-the-art techniques.
What problem does this paper attempt to address?