Research on Parallel Duplicated Webpage Deletion Based on MapReduce Model

Shu Gao
2010-01-01
Abstract:In order to decrease the difficulty of parallel program with large scale text data processing in search engine development,an algorithm based on MapReduce presented by Google was achieved.Through analyzing the parallel data process in search engine system,the original MapReduce model was extended to solve the shortage in reducer scheduling of the results generated from the Mapper process in duplicated webpage deletion,which was the pretreatment in the search engine system.At last,the actual efficiency of the extended MapReduce model in parallel data processing system was analyzed according to the practical application.
What problem does this paper attempt to address?