A Lightweight and Flexible Tool for Distinguishing Between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs

Guozhen Zhang,Yi Liu,Hailong Yang,Depei Qian
DOI: https://doi.org/10.1109/access.2018.2882394
IF: 3.9
2018-01-01
IEEE Access
Abstract:In this paper, we propose a new technique to distinguish the reason for program failure between hardware malfunctions and program bugs, which mitigates the impact of shorter mean time between failures to the debugging process on the future exa-scale supercomputers and improves the productivity of debugging large-scale parallel programs. Our technique detects program failures by observing the abnormal message passing behaviors with distributed monitors and leverages event-driven mechanism to trigger global status checking among different node groups concurrently. Besides, both coarse-grained execution snapshots and fine-grained failure events can be provided for further failure diagnosis and bug analysis. We implement this technique as a user-space library named failure cause resolver (FCR). Experimental results on the Tianhe-2 supercomputer demonstrate that the latency of FCR for failure detection is acceptable with negligible overhead. In addition, FCR does not require administrative privilege and can be easily integrated into existing large-scale parallel programs.
What problem does this paper attempt to address?