Varcatcher: A Framework for Tackling Performance Variability of Parallel Workloads on Multi-Core

Weihua Zhang,Xiaofeng Ji,Bo Song,Shiqiang Yu,Haibo Chen,Tao Li,Pen-Chung Yew,Wenyun Zhao
DOI: https://doi.org/10.1109/tpds.2016.2613524
IF: 5.3
2017-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:The non-deterministic nature of multi-threaded workloads running on multi-core platforms often leads to notable performance variability from run to run. Such variability makes experimental results prone to misinterpretations or misguided claims. To deal with such variability, statistical inference methods are usually used to summarize the experimental results with certain confidence levels by running the experiments or measurements a large number of times. However, such statistical results are often too vague or too simplistic. They are not sufficient to help users understand the causes of such variability, and allow more in-depth analysis on the results or reproduce the results for validation during design space exploration. To allow better analyzability and reproducibility, we propose a framework to tackle such variability, called VarCatcher. The key to VarCatcher is to characterize a parallel execution using Parallel Characteristics Vector (PCV). A clustering-based approach is then used to group runs with similar execution characteristics that can later be used to analyze results in-depth, to customize different evaluation strategies, reproduce the result for variability, to determine the impact of features, or to assist performance diagnosis. We have built a prototype of VarCatcher that includes a user-level toolset for runtime monitoring and measurements using the Intel Processor Trace feature on commodity Intel processors as well as an architecture extension with very low runtime overheads (around 3 and 0.01 percent accordingly). Several case studies confirm that VarCatcher enables several appealing features such as in-depth result analysis, customized evaluation strategies, and reproducibility.
What problem does this paper attempt to address?