Near-duplicate Document Detection with Improved Similarity Measurement

Xin-pan Yuan,Jun Long,Zu-ping Zhang,Wei-hua Gui
DOI: https://doi.org/10.1007/s11771-012-1267-z
2012-01-01
Abstract:To quickly find documents with high similarity in existing documentation sets, fingerprint group merging retrieval algorithm is proposed to address both sides of the problem: a given similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. It can be proved that the efficiency of similarity retrieval is improved by fingerprint group merging retrieval algorithm with lower similarity threshold. Experiments with the lower similarity threshold r =0.7 and high fingerprint bits k =400 demonstrate that the CPU time-consuming cost decreases from 1 921 s to 273 s. Theoretical analysis and experimental results verify the effectiveness of this method.
What problem does this paper attempt to address?