A measure based on optimal matching in graph theory for document similarity

Xiaojun Wan,Yuxin Peng
DOI: https://doi.org/10.1007/978-3-540-31871-2_20
2005-01-01
Abstract:Measuring pairwise document similarity is critical to various text retrieval and mining tasks. The most popular measure for document similarity is the Cosine measure in Vector Space Model. In this paper, we propose a new similarity measure based on optimal matching in graph theory. The proposed measure takes into account the structural information of a document by considering the word distributions over different text segments. It first calculates the similarities for different pairs of text segments in the documents and then gets the total similarity between the documents optimally through optimal matching. We set up experiments of document similarity search to test the effectiveness of the proposed measure. The experimental results and user study demonstrate that the proposed measure outperforms the most popular Cosine measure.
What problem does this paper attempt to address?