Sentence Near-Duplicate Detection Based on Low-IDF-SIG

Xuanjing Huang
2011-01-01
Abstract:Because of the explosion of the Internet,enormous duplicated data cause serious problem for search engine,opinion mining and many other Web applications.Most existing near-duplicate detection approaches focus on the document level,incpapble of finding out the duplicated part that is just a small piece of both documents.Near-duplicate detection on sentence level is a key solution to such problem.An effective and efficient feature extraction algorithm namedLow-IDF-Sig is proposed in this paper.In order to express a specified sentence,our algorithm extracts the improved Shingle feature according to selected antecedents.Experimental results based on a real corpus show that our proposed method can improve both precision and efficiency of near-duplicate detection in sentence level.
What problem does this paper attempt to address?