A feature selection algorithm for document clustering based on word co-occurrence frequency

Yuan-Chao Liu,Xiao-Long Wang,Bing-Quan Liu
DOI: https://doi.org/10.1109/ICMLC.2004.1378540
2004-01-01
Abstract:Constructing feature space by only selecting more informative words can speed up document clustering algorithm greatly, and the cluster quality is not affected. In this paper, firstly, the impact of feature selection on document clustering is discussed, then, a new solution for feature selection was brought forward which is based on word co-occurrence frequency. According to cluster hypothesis, the documents from the same class are more similar to each other when they are represented in vector space model (VSM), so many of the words from these documents are always in company with each other. We find these words by word co-occurrence, and then construct reduced feature space for clustering. Experiments show that the selected features are more salient. Clustering documents in the new reduced feature space, run time is shortened greatly, whereas the cluster quality is almost unchanged, thus make clustering algorithm more suitable for practical use.
What problem does this paper attempt to address?