A Novel Clustering Algorithm for Large-Scale Text Collection and Its Incremental Version.
Lei Chen,Ming Liu,Chong Wu,Ai Xu
DOI: https://doi.org/10.5755/j01.itc.45.2.8666
IF: 0.813
2016-01-01
Information Technology And Control
Abstract:Nowadays, the fast advance of internet technology has brought two challenges. One is the explosion of information. The other is that new information appears almost every day. Obviously, clustering is a good solution to help users analyze information automatically, whereas traditional clustering algorithms are only suitable for smallscale and stable text collection. In order to cluster large-scale and unstable texts, a novel clustering algorithm based on vector compression is proposed in this paper. We call this algorithm VCLC, abbreviated from a clustering algorithm based on vector compression for large-scale text collection. Experimental results demonstrate that VCLC is effective for clustering large-scale text collection. The reason is that VCLC selects related features to compress feature sets, and iterative training idea of self-organizing-mapping (SOM) is also adopted in it to fine-tune the weights of the features to enhance clustering performance. Besides, an incremental version of VCLC, namely I-VCLC, is also provided in this paper. When novel texts appear, I-VCLC chooses some samples from the original texts to alter neuron model to perform incremental clustering. In order to prevent over training, I-VCLC adjusts the weights of the samples along with training process. Experimental results demonstrate that I-VCLC can cluster unstable texts very well.