DS<SUP>2</SUP> : Handling Data Skew Using Data Stealings over High-Speed Networks

Zeyu He,Zhifang Li,Xiaoshuang Peng,Chuliang Weng
DOI: https://doi.org/10.1109/ICDE51399.2021.00168
2021-01-01
Abstract:Distributed in-memory computing systems have dramatic performance improvement over traditional disk-based systems, which makes them widely used in large-scale data processing applications. Unfortunately, uneven and unpredictable data distributions caused by data skew have a significant impact on the performance. In Spark, when data skew happens, some tasks will process much more data than other tasks and become the performance bottleneck. The traditional approaches to handling data skew are based on sampling and repartitioning, which incur additional overhead. In this paper, we divide data skew in distributed data processing systems into intra-node and internode skew. Based on data stealing, we proposed DS2 to handle both intra-node and inter-node data skew. It aims to improve the performance under data skew, without involving additional overhead. DS2 first balances the skewed data distribution in the local and then handles the inter-node skew by RDMA during execution. It achieves up to 2.96x speedup on the aggregation operator and 2.81x speedup on the join operator.
What problem does this paper attempt to address?