Re-Running Large-Scale Parallel Programs Using Two Nodes.
Yayu Guo,Fang Lin,Yi Liu,Depei Qian
DOI: https://doi.org/10.1109/bdcloud.2018.00079
2018-01-01
Abstract:With the increasing of scale and complexity of high performance computing (HPC) systems, the programming, debugging, and tuning of large-scale parallel programs face a series of challenges, one of which is that programmers often need to repeatedly run their programs with large number of processes on HPC systems to identify sources of errors and performance bottlenecks in their programs, which means large amounts of resource consumptions. Furthermore, since most HPC systems use job scheduling system to manage their resources and schedule multiple jobs from different users, programmers cannot interact with their programs during the execution of programs, which further increases complexities of debugging and tuning. To address this challenge, this paper proposes a system that reruns large-scale MPI parallel programs using two nodes. According to an approach of one real-execution + multiple emulation-executions, the parallel program is firstly executed with desired number of processes on an HPC system, which is referred as real-execution, and during the execution, our system records MPI messages transmitted among processes as well as control information of processes; after that, one or more processes can be re-run on a two-node local system under the scale the same with the real-execution. In the meantime, programmers can interact with their programs by attaching the GDB, a commonly used debugger, to the re-running process. Therefore, not only can our system reduce resource-consumptions in debugging and tuning of large-scale parallel programs significantly, but also support interactions between developers and their programs during the execution of the programs, which makes programmers easier to identify sources of the errors and performance bottlenecks in their parallel programs.