SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming
Guipeng Liu,Xiaomin Zhu,Ji Wang,Deke Guo,Weidong Bao,Hui Guo
DOI: https://doi.org/10.1016/j.future.2017.07.014
IF: 7.307
2018-09-01
Future Generation Computer Systems
Abstract:Spark Streaming, a popular tool for processing live data streams, offers a good divide-and-conquer solution, where data stream is split into batches that are then processed in parallel by mappers, and the intermediate data from the mappers are finally reduced by reducers. However, one of the key issues with such an approach for live data processing is partitioning skew in which data distributed over the processing units are not balanced due to uncertainty of the coming data streams. This imbalance is rippled through the mappers and become prominent to the reducers, making reduce a performance bottleneck to the overall system. To address this issue, we present a Partitioner, SP-Partitioner, that sits between the map and reduce stages to re-balance the workload of the reducers. With our design, we treat the arrived batches of data as candidate samples and choose samples based on systematic sampling to predict the characteristics of intermediate data. According to the prediction, our method generates a reference table to guide the allocation of next batches of data evenly. We implement SP-Partitioner in Spark 1.6.1 and evaluate its performance with widely used applications. Experimental results conducted on a real VMs cluster show that our algorithms can not only achieve higher balancing performance on data with varying degree of data skew, but also decrease the average processing time of one batch of these data.