Mechanisms of Optimizing MapReduce Framework on High Performance Computer

Jie Yu,Guangming Liu,Wei Hu,Wenrui Dong,Weiwei Zhang
DOI: https://doi.org/10.1109/HPCC.and.EUC.2013.104
2013-01-01
Abstract:With the amount of data growing constantly and exponentially, the industry has encountered an unprecedented challenge of efficiently and reliably processing a tremendous amount of data. High performance computer has played a major role in the field of big data processing for its serious computational power and super-large storage. However, it remains some inevitable drawbacks to efficiently utilize the HPC due to its relatively lower availability and usability. We propose to implement MapReduce framework on HPC to solve above problems and extensively expand the application field of HPC. We design a workable plan to deploy Hadoop on HPC with a Lustre file system, and tune Lustre to a better performance based on the nature of data access in Hadoop. Virtual memory disk is proposed to efficiently buffer temporary data and store intermediate data. By taking advantage of high-speed interconnect system of HPC, the intermediate data can be transferred efficiently from map task to reduce task, which cannot be achieved in a Hadoop system on server cluster since the rate of data flow is bounded by the bandwidth of low-speed network, such as Ethernet. The evaluation driven by the standard benchmarks provided in Hadoop package shows that after applying the proposed optimization method, the Hadoop system on HPC gets better performance than Hadoop system on server cluster, especially when handle data-intensive applications.
What problem does this paper attempt to address?