A Text Similarity Simulation Detection Model Based on Hadoop

Yun WU,Kangzhen XU,Ruizhang HUANG
DOI: https://doi.org/10.13568/j.cnki.651094.2017.03.010
2017-01-01
Abstract:With the increasing amount of data in the information age,traditional text similarity computing method has been unable to deal with large-scale text data,aiming at these problems,this text puts forward a kind of text similarity simulation detection model based on Hadoop cluster technology.The detection model is divided into three steps:the first step is to use the Hadoop tool to build the experimental platform,and the platform for the optimization of hardware and software.The second step to the document into a collection,using an improved MapReduce based programming model based on Shingling algorithm.In the third step,a distributed NewMinhash algorithm is proposed to solve the signature matrix,and then the Jaccard coefficients are used to calculate the similarity.Experiments show that for the same operation,the performance of the optimized time decreased by nearly 5.65%,the simulation model is not only more accurate for text similarity,but also can better adapt to the distributed processing platform for the large-scale text data,and has a good scalability.
What problem does this paper attempt to address?