Efficient Near-Duplicate Detection for Q&A Forum.

Yan Wu,Qi Zhang,Xuanjing Huang
2011-01-01
Abstract:This paper addresses the issue of redundant data in large-scale collections of Q&A forums. We propose and evaluate a novel algorithm for automatically detecting the near-duplicate Q&A threads. The main idea is to use the distributed index and Map-Reduce framework to calculate pair-wise similarity and identify redundant data fast and scalably. The proposed method was evaluated on a real-world data collection crawled from a popular Q&A forum. Experimental results show that our proposed method can effectively and efficiently detect near-duplicate content in large web collections.
What problem does this paper attempt to address?