Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration
Kejiang Ye,Xiaohong Jiang,Yanzhang He,Xiang Li,Haiming Yan,Peng Huang
DOI: https://doi.org/10.1109/clusterw.2012.32
2012-01-01
Abstract:Big data processing is currently becoming increasingly important in modern era due to the continuous growth of the amount of data generated by various fields such as particle physics, human genomics, earth observation, etc. However, the efficiency of processing large-scale data on modern virtual infrastructure, especially on the virtualized cloud computing infrastructure, is not clear. This paper focuses on the performance of hadoop virtual cluster and proposes a scalable hadoop virtual cluster platform vHadoop for the large-scale MapReduce-based parallel data processing. We first describe the design and implementation of vHadoop platform. Then we perform a series of experiments to investigate both the static and dynamic performance of vHadoop platform, such as the performance characterization of cross-domain hadoop virtual cluster and live migraiton of hadoop virtual cluster. After that, we use the vHadoop platform to process 6 typical parallel clustering algorithms, such as Canopy, Dirichlet, Fuzzy k-Means, k-Means, Mean Shift, MinHash, etc, on two typical datasets. Experimental results verify the efficiency of vHadoop platform to process the MapReduce-based parallel machine learning applications.