Two-Level Hybrid Sampled Simulation of Multithreaded Applications

Chuntao Jiang,Zhibin Yu,Lieven Eeckhout,Hai Jin,Xiaofei Liao,Cheng-Zhong Xu
DOI: https://doi.org/10.1145/2818353
IF: 1.444
2016-01-01
ACM Transactions on Architecture and Code Optimization
Abstract:Sampled microarchitectural simulation of single-threaded applications is mature technology for over a decade now. Sampling multithreaded applications, on the other hand, is much more complicated. Not until very recently have researchers proposed solutions for sampled simulation of multithreaded applications. Time-Based Sampling (TBS) samples multithreaded application execution based on time—not instructions as is typically done for single-threaded applications—yielding estimates for a multithreaded application’s execution time. In this article, we revisit and analyze previously proposed TBS approaches (periodic and cantor fractal based sampling), and we obtain a number of novel and surprising insights, such as (i) accurately estimating fast-forwarding IPC, that is, performance in-between sampling units, is more important than accurately estimating sample IPC, that is, performance within the sampling units; (ii) fast-forwarding IPC estimation accuracy is determined by both the sampling unit distribution and how to use the sampling units to predict fast-forwarding IPC; and (iii) cantor sampling is more accurate at small sampling unit sizes, whereas periodic is more accurate at large sampling unit sizes. These insights lead to the development of Two-level Hybrid Sampling (THS), a novel sampling methodology for multithreaded applications that combines periodic sampling’s accuracy at large time scales (i.e., uniformly selecting coarse-grain sampling units across the entire program execution) with cantor sampling’s accuracy at small time scales (i.e., the ability to accurately predict fast-forwarding IPC in-between small sampling units). The clustered occurrence of small sampling units under cantor sampling also enables shortened warmup and thus enhanced simulation speed. Overall, THS achieves an average absolute execution time prediction error of 4% while yielding an average simulation speedup of 40 × compared to detailed simulation, which is both more accurate and faster than the current state-of-the-art. Case studies illustrate THS’ ability to accurately predict relative performance differences across the design space.
What problem does this paper attempt to address?