Text Deduplication with Minimum Loss Ratio.

Youming Ge,Jiefeng Wu,Genan Dai,Yubao Liu
DOI: https://doi.org/10.1145/3318299.3318369
2019-01-01
Abstract:Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.
What problem does this paper attempt to address?