Spatially Bursty I/O on Supercomputers: Causes, Impacts and Solutions.

Jie Yu,Wenxiang Yang,Fang Wang,Dezun Dong,Jinghua Feng,Yuqi Li
DOI: https://doi.org/10.1109/tpds.2020.3005572
IF: 5.3
2020-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Understanding the I/O characteristics of supercomputers is crucial for grasping accurate I/O workloads and uncovering potential I/O inefficiency. We collect and analyze I/O traces from two production supercomputers, and find that the I/O traffic peaks in the system not only occur in short periods of time but also originate from a minority of adjacent compute nodes, which we call spatially bursty I/O. Since modern supercomputers widely adopt I/O forwarding architecture, in which an I/O node performs I/O on behalf of a subset of compute nodes in the vicinity, spatially bursty I/O will cause significant load imbalance and underutilization on the I/O nodes. To address such problems, we quantitatively analyze the two causes of spatially bursty I/O, including uneven I/O distribution on job's processes and uneven job nodes distribution on the system. Two different solutions are proposed to mobilize more I/O nodes to participate in job's I/O activity. (1) We change the I/O node mapping, making adjacent compute nodes use different I/O nodes instead of a same one. (2) According to the job's I/O characteristics extracted from history I/O traces, we distribute the compute nodes of data-intensive jobs more sparsely to utilize more I/O nodes. Extensive evaluations of both solutions show that they can further exploit the potential of I/O forwarding layer. We have deployed the proposed I/O node mapping on a production supercomputer for 11 months. Our experience finds that it can effectively promote I/O performance, balance loads, and alleviate I/O interference.
What problem does this paper attempt to address?