DarkDiff: Explainable web page similarity of TOR onion sites

Pieter Hartel,Eljo Haspels,Mark van Staalduinen,Octavio Texeira
2023-08-23
Abstract:In large-scale data analysis, near-duplicates are often a problem. For example, with two near-duplicate phishing emails, a difference in the salutation (Mr versus Ms) is not essential, but whether it is bank A or B is important. The state-of-the-art in near-duplicate detection is a black box approach (MinHash), so one only knows that emails are near-duplicates, but not why. We present DarkDiff, which can efficiently detect near-duplicates while providing the reason why there is a near-duplicate. We have developed DarkDiff to detect near-duplicates of homepages on the Darkweb. DarkDiff works well on those pages because they resemble the clear web of the past.
Cryptography and Security
What problem does this paper attempt to address?