PDM: A Parallel Data Analysis System Based on Hadoop

DUAN Song-qing,WU Bin,YU Le,WANG Bai
DOI: https://doi.org/10.3969/j.issn.1674-2974.2012.10.015
2012-01-01
Abstract:A PDM(Parallel Data Mining) system was built based on Hadoop.PDM contains a large number of parallel data analysis algorithms based on MapReduce computational framework.These algorithms not only contain the classic algorithms of ETL,data mining,data statistical and text analysis,but also introduce SNA(social network analysis) based on graph mining.The principle and implementation of the parallel multiple linear regression algorithm and the multi-source shortest path algorithm were described and the Message-passing model proposed can effectively solve the problem that MapReduce is difficult to deal with the adjacency matrix structure.This paper also illustrates some typical applications of telecommunications,such as the Business recommendation based on parallel k-means and decision tree algorithms,the Marketing key points discovery based on parallel PageRank algorithm and the like.Finally,the results of performance test show that the proposed system is suitable for dealing with large scale data efficiently.
What problem does this paper attempt to address?