Inducing Word Senses for Cross-lingual Document Clustering

Guoyu Tang,Yunqing Xia,Erik Cambria,Peng Jin
DOI: https://doi.org/10.1109/cis.2013.93
2013-01-01
Abstract:Cross-lingual document clustering is the task of automatically organizing a large collection of cross-lingual documents into a few groups according to their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To address such issues, we propose to represent cross-lingual documents through statistical word senses, which are learned from a parallel corpus by means of a novel cross-lingual word sense induction model. Furthermore, a sense clustering method is adopted to discover semantic relation of word senses, which are used to represent cross-lingual documents through a sense-based vector space model. Evaluation on a benchmarking dataset shows that the proposed model outperforms two state-of-the-art models in cross-lingual document clustering.
What problem does this paper attempt to address?