A Performance Evaluation of Classification Algorithms for Big Data.
Mo Hai,You Zhang,Youjin Zhang
DOI: https://doi.org/10.1016/j.procs.2017.11.479
2017-01-01
Procedia Computer Science
Abstract:The performance of two typical classification algorithms in Spark: random forest and naïve bayes are evaluated by using four metrics: classification accuracy, speedup, scaleup and sizeup. Experiments are performed on dataset and clusters of different scale. The results show that: (1) the accuracy of the two algorithms is high; (2) the increase of speedup is not linear. For the dataset with different size, the numbers of nodes is different when the speedup is the maximal; (3) the scaleup of random forest reaches its peak when the number of nodes is 2, and after that the scaleup decreases with the increase of the number of nodes;(4) for random forest, when the number of nodes is 2, the sizeup increases sharply with the increase of the size of dataset, and when the number of nodes is greater than 2, the sizeup increases more slowly with the increase of the size of dataset; for naïve bayes, when the number of nodes is smaller than 6, the sizeup increases with the increase of the size of dataset, when number of nodes is 6 and the size of dataset is larger than that of Sogou_5, the change of the sizeup is not obvious with the increase of the size of dataset.