Implementation of Parallel PageRank Algoirthm Based on MapReduce

Yu PING,Yang XIANG,Bo ZHANG,Yin-fei HUANG
DOI: https://doi.org/10.3969/j.issn.1000-3428.2014.02.007
2014-01-01
Abstract:The emergence of distributed Web crawl largely expands the scale of related Web information. Since PageRank needs to process the topology of entire existed page set, the limitation of CPU, I/O and memory becomes the big issue when it confronts the data in TB or PB level. Aiming at these problems, this paper proposes a parallel PageRank algorithm based on MapReduce. In a certain iteration of algorithm, it processes the files containing the topology of Web page graph by Map function and calculates the pages’ scores by Reduce function. Using the global Web page score as convergence to control iterations and get more precise Web page sorting result. Experimental result shows that the improved algorithm has better clustering performance and faster execution speed on the basis of keeping the overall Web page sorting accuracy of single machine PageRank algorithm.
What problem does this paper attempt to address?