Compound Method Based on Frequent Terms for Near Duplicate Documents Detection

Gaudence Uwamahoro,Zuping Zhang,Ambele Robert Mtafya,Jun Long
DOI: https://doi.org/10.14257/ijdta.2014.7.6.05
2014-01-01
International Journal of Database Theory and Application
Abstract:Examining data to find similar data is a major problem in data mining and information retrieval.There are abundant documents that contain information.Most of those documents are duplicates or near duplicates and they increase storage space and cost time for searching for information needed.Reduction of dimensionality and well organization of data are the ways that can be used to solve the problem of efficiency.In this paper we proposed a method based mined frequent terms from each document to reduce the data size and efficient method for clustering documents that have close similarity between them.Using our method only 36.4% of original size has been used.The similarity between documents is based on frequent terms shared.Our method performs well on running time of O(n) whereas the current methods for clustering require O(n 3 ).
What problem does this paper attempt to address?