ReCT: Improving MapReduce Performance under Failures with Resilient Checkpointing Tactics

Hao Wang,Haopeng Chen,Fei Hu
DOI: https://doi.org/10.1109/bigdata.2014.7004380
2014-01-01
Abstract:MapReduce is a programming paradigm that makes it simple and efficient to process vast amount of data. It targets at very big clusters, where failures are no longer exceptions. Fault tolerance is vital to MapReduce, however, fault tolerance and recovery strategies in MapReduce perform poorly under failures. Currently fault tolerance is implemented at the task level, a task failure will lead to a re-execution of the whole task. In this work, we present ReCT, a family of resilient checkpointing tactics(ReCT) to intensively improve MapReduce performance under map task failures. ReCT introduces slight changes to current MapReduce execution flow and makes it possible to create checkpoints beneath the task level. In case of task failures, ReCT tries to make the most of finished partial tasks and skip them in retry attempts. The checkpointing tactics bring little overhead and intensively accelerate fault recovery process. We also observe that under some circumstances, the new execution flow in ReCT involves much less IO operations than that in Hadoop. ReCT outperforms Hadoop by 6.6% on average under no failures and 4.6% to 51.0% under different failure densities.
What problem does this paper attempt to address?