Parallelization of Classification Algorithms Based on SparkR

黄宜华,袁春风,刘志强,顾荣
DOI: https://doi.org/10.3778/j.issn.1673-9418.1503036
2015-01-01
Abstract:In recent years, parallelizing algorithms for big data machine learning and data mining have become an important research issue in the field of big data. Spark provides a programming interface called SparkR to support data analysts who are familiar with the R language in the general application areas to conduct the data analysis and com-putations on the Spark platform. This paper proposes the design and implementation of several widely-used parallel classification algorithms including Multinomial NaiveBayes, SVM (support vector machine) and Logistic Regres-sion based on SparkR. This paper also presents how to optimize the SVM and Logistic Regression algorithms to improve the training speed based on conventional parallel strategies. The experimental results show that the efficiency of the classification algorithms based on SparkR outperforms Hadoop MapReduce with 8 times of speedup without losing scalability.
What problem does this paper attempt to address?