Efficient Near-duplicate Detection Based on Multiple SimHash Fingerprints

DONG Bo,ZHENG Qing-hua,SONG Kai-lei,TIAN Feng,MA Rui
2011-01-01
Abstract:Near-duplicate detection has attracted significant attention over the past years.The near-duplicate detection based on SimHash is one of the state-of-the-art algorithms.However,there exists a problem for this method: SimHash maps high-dimensional vectors to small-sized and well formatted(fixed length) fingerprints,which lost a certain amount of information.To solve the problem,this paper firstly introduces the analyses of statistical characteristics of term sets.Then a novel and efficient near-duplicate detection scheme based on multiple SimHash fingerprints and k-dimensional hypersurfaces is presented.Experimental results prove that the scheme can significantly improve the precision and F1,and execution times are almost remained unchanged.
What problem does this paper attempt to address?