Revisiting Performance in Big Data Systems

Chen Yang,Qi Guo,Xiaofeng Meng,Rihui Xin,Chunkai Wang
DOI: https://doi.org/10.1145/3127479.3132685
2017-01-01
Abstract:Big data systems for large-scale data processing are now in widespread use. To improve their performance, both academia and industry have expended a great deal of effort in the analysis of performance bottlenecks. Most big data systems, as Hadoop and Spark, allow distributed computing across clusters. As a result, the execution of systems always parallelizes the use of the CPU, memory, disk and network. If a given resource has the greatest limiting impact on performance, systems will be bottlenecked on it. For a system designer, it is effective for the improvement of performance to tune the bottleneck resource. The key point for the aforementioned scenario is how to determine the bottleneck resource. The nature clue is to quantify the impact of the four major components and identify one causing the greatest impact factor as the bottleneck resource.
What problem does this paper attempt to address?