Document Clustering Based on Word Sense Cluster

TANG Guoyu,XIA Yunqing,ZHANG Min,ZHENG Fang
DOI: https://doi.org/10.3969/j.issn.1003-0077.2013.03.015
2013-01-01
Abstract:Document representation is the key part in document clustering.In this paper,we aim at improving document representation in document clustering.Synonymy and polysemy are two challenging issues in document representation.Inspired by the observation that synonymy and polysemy are mainly related to word sense,we present a novel model,referred to as Sense Cluster Model(SCM),to address both issues by representing documents with word sense clusters.In SCM,word sense clusters are first constructed from the development dataset by 1) the word sense induction to automatically discover different senses of each word from raw text;and 2) the word sense clusteringto recognize identical or similar words.Then the probability distribution over word sense clusters is generated to represent every document after word sense disambiguation.The experiments conducted on benchmarking data show that the SCM model outperforms both baseline and the classic topic model,LDA,in the task of document clustering.
What problem does this paper attempt to address?