Practice Guideline for Heavy I/O Workloads with Lustre File Systems on TACC Supercomputers.

Si Liu,Lei Huang,Hang Liu,Amit Ruhela,Virginia Trueheart,Susan Lindsey,Quan Yuan
DOI: https://doi.org/10.1145/3437359.3465570
2021-01-01
Abstract:While the computational power of modern supercomputers has risen tremendously in recent years, user’s I/O work (read, write, open, close, etc.) has increased correspondingly. As a result of the high I/O load, these jobs often overwhelm the supercomputers’ file systems. Generating hundreds of GB of data or tens of thousands of I/O operations in a very short period of time significantly slows down file systems and in some cases may result in a crash incurring the loss of users’ compute time, a great number of user services tickets, and poor reliability perception. Nearly a decade of close observation and study of file systems has led us to formulate new guidelines and invent several tools to alleviate the I/O issues in the current supercomputing environment. In this paper, we focus on I/O work implemented on the Lustre parallel file systems of Frontera and Stampede2, but also investigate other types of file systems employed on other TACC machines. We also discuss common I/O issues collected from supercomputer users, including: high frequency of MDS requests, overloaded OSS, unstriped large files, etc. To solve these problems, we offer important guidelines on how to choose appropriate file systems. Furthermore, we introduce novel tools and workflows, such as CDTools, Python_Cacher, OOOPS, and Auto Striping to facilitate users’ I/O work. We believe these tools will greatly benefit users who need to manage heavy I/O workloads on parallel file systems.
What problem does this paper attempt to address?