HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs

Jianchao Yang,Mei Wen,Dong Chen,Zhaoyun Chen,Zeyu Xue,Yuhang Li,Junzhong Shen,Yang Shi
DOI: https://doi.org/10.1109/micro61859.2024.00022
2024-01-01
Abstract:The widespread adoption of GPUs has driven the development of GPU simulators, which, in turn, lead advancements in both GPU architectures and software optimization. Trace-driven cycle-accurate Cycle-accurate simulators, which provide detailed microarchitectural models and clock-level precision, come at the cost of extended simulation times and require high computational resources. Their scalability has become a bottleneck. A growing trend is the adoption of cycle-approximate simulators, which introduce mathematical modeling of partial hardware units and utilize sampling to accelerate simulation. However, this approach faces challenges regarding the accuracy of performance predictions. To address these limitations, we introduce HyFiSS, a hybrid fidelity stall-aware GPU simulator. HyFiSS features fine-grained stall events tracking and attribution by constructing a detailed execution pipeline model for various stall events on Streaming Multiprocessors (SMs). It accurately emulates the thread block scheduler behavior using real-time scheduling logs and utilizes sampling based on thread block sets to minimize the precision loss due to fine-grained sampling points on the microarchitectural state. We achieve a balance between reliability, speed, and the level of simulation detail, especially regarding bottlenecks. By evaluating a diverse set of benchmarks, HyFiSS achieves a mean absolute percentage error in predicting active cycles that is comparable to the state-of-the-art cycle-accurate simulator Accel-Sim. Moreover, HyFiSS achieves a substantial 12.8 × speedup in the simulation efficiency compared to Accel-Sim. HyFiSS also requires at least 3.2 × less disk storage than both Accel-Sim and another state-of-the-art cycle-approximate simulator PPT-GPU due to its efficient SASS (Streaming Assembler) traces compression. With precise, per-cycle stall events statistics, HyFiSS can provide accurate GPU performance metrics and stall cause reporting. This significantly simplifies performance analysis, bottleneck identification, and performance optimization tasks for researchers, making it easier to enhance GPU performance effectively.
What problem does this paper attempt to address?