Cross-Lingual Document Clustering Based on Similarity Space Model

TANG Guoyu,XIA Yunqing,ZHANG Min
DOI: https://doi.org/10.3969/j.issn.1003-0077.2012.02.021
2012-01-01
Abstract:Cross-Lingual Document Clustering is the task to automatically organize a large collection of cross-lingual documents into groups according to their contents or topics.This work extends traditional monolingual Generalized Vector Space Model(GVSM) to Cross-Lingual GVSM(CLGVSM) by using cross-lingual term similarity calculation methods in order to represent documents in different languages and compare different term similarity calculation methods in cross-lingual document clustering.This work also proposes new feature selection method for CLGVSM.Experiment results show that GVSM with Second Order Co-occurrence Point wise Mutual Information(SOCPMI) term similarity measure outperforms the latent semantic analysis(LSA) method.
What problem does this paper attempt to address?