Abstract:To find near-duplicate documents, fingerprint-based paradigms such as Broder's shingling and Charikar's simhash algorithms have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these approaches which may be improved. First, high score under these algorithms' similarity measurement implies high probability of similarity between documents, which is different from high similarity of the documents. But how similar two documents are is what we really need to know. Second, there has to be a tradeoff between hash-code length and hash-code multiplicity in fingerprint paradigms, which makes it hard to maintain a satisfactory recall level while improving precision. In this paper our contributions are two-folded. First, we propose a framework for implementing the longest common subsequence (LCS) as a similarity measurement in reasonable computing time, which leads to both high precision and recall. Second, we present an algorithm to get a trustable partition from the LCS to reduce the negative impact from templates used in web page design. A comprehensive experiment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result. More specifically, the method has been successfully used to partition a set of 430 million web pages into 68 million subsets of similar pages, which demonstrates its effectiveness. For quality, we compared our method with simhash and a Cosine-based method through a sampling process (Cosine is compared to LCS as an alternative similarity measurement). The result showed that our algorithm reached an overall precision of 0.95 while simhash was 0.71 and Cosine was 0.82. At the same time our method obtains 1.86 times as much recall as simhash and 1.56 times as much recall as Cosine. Comparison experiment was also done for documents in the same web sites. For that, our algorithm, simhash and Cosine find almost the same number of true-positives at a precision of 0.91, 0.50 and 0.63 respectively. In terms of efficiency, our algorithm takes 118 hours to process the whole archive of 430 million topic-type pages on a cluster of six Linux boxes, at the same time the processing time of simhash and Cosine is 94 hours and 68 hours respectively. When considering the need of word segmentation for languages such as Chinese, the processing time of Cosine should be multiplied and in our experiment it is 602 hours.

Text Deduplication with Minimum Loss Ratio.

Research on method to detect reduplicative Chinese short texts

DEDUCT: A Secure Deduplication of Textual Data in Cloud Environments

Data Deduplication with Random Substitutions

Sizespotsigs: An Effective Deduplicate Algorithm Considering The Size Of Page Content

Efficient Partial-Duplicate Detection Based on Sequence Matching

A binary-tree based algorithm for online duplicate documents detection

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

Duplicate Web Page Elimination Based on HTML and Extraction of Long Sentence

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

Chunk Content is not Enough: Chunk-Context Aware Resemblance Detection for Deduplication Delta Compression

FuzzyDedup: Secure Fuzzy Deduplication for Cloud Storage

Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models

The Design of a Lossless Deduplication Scheme to Eliminate Fine-grained Redundancy for JPEG Image Storage Systems

Double sliding window chunking algorithm for data deduplication in ocean observation

Deduplicating Training Data Makes Language Models Better

Reducing Data Fragmentation in Data Deduplication Systems via Partial Repetition and Coding

Improved fuzzy set information retrieval approach on duplicate webpage detection

Achieving Both High Precision and High Recall in Near-Duplicate Detection.

Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research