Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Guoyu Tang,Yunqing Xia,Erik Cambria,Peng Jin,Thomas Fang Zheng
DOI: https://doi.org/10.1142/s021800141559003x
IF: 1.261
2015-01-01
International Journal of Pattern Recognition and Artificial Intelligence
Abstract:Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.
What problem does this paper attempt to address?