Efficient Data Blocking and Skipping Framework Applying Heuristic Rules

Yong Wang,Xiaochun Yun,Xi Wang,Shupeng Wang,Yongshang Wu
DOI: https://doi.org/10.1109/icpads.2017.00037
2017-01-01
Abstract:Data blocking has been an effective technique of data skipping to reduce data access and shorten query response time in query engines. By generating fine-grained, balanced blocks and corresponding metadata, a query may skip a block if the metadata indicates that the block does not contain relevant data. Obviously, the deciding factor of a promising blocking strategy depends on how to produce effective data layout in reasonable time that is expected to skip most data. In this paper, we propose several algorithms that drastically reduce the time complexity of existent blocking strategies based on workload analysis, at the cost of relatively small loss of estimated tuples could be skipped. Via theoretical analysis, we prove that the time complexity of our algorithms is apparently lower than that of ward algorithm. Afterwards, we demonstrate the whole blocking and skipping workflow, install it into Spark SQL and obtain experimental evaluation results. Experimental results show that our technique gains significant improvement in aspect of blocking efficiency compared to ward algorithm, while keeping almost the same level of skipping ability.
What problem does this paper attempt to address?