Parallel I/O Characterization and Optimization on Large-Scale HPC Systems: A 360-Degree Survey

Hammad Ather,Jean Luca Bez,Chen Wang,Hank Childs,Allen D. Malony,Suren Byna
2024-12-31
Abstract:Driven by artificial intelligence, data science, and high-resolution simulations, I/O workloads and hardware on high-performance computing (HPC) systems have become increasingly complex. This complexity can lead to large I/O overheads and overall performance degradation. These inefficiencies are often mitigated using tools and techniques for characterizing, analyzing, and optimizing the I/O behavior of HPC applications. That said, the myriad number of tools and techniques available makes it challenging to navigate to the best approach. In response, this paper surveys 131 papers from the ACM Digital Library, IEEE Xplore, and other reputable journals to provide a comprehensive analysis, synthesized in the form of a taxonomy, of the current landscape of parallel I/O characterization, analysis, and optimization of large-scale HPC systems. We anticipate that this taxonomy will serve as a valuable resource for enhancing I/O performance of HPC applications.
Distributed, Parallel, and Cluster Computing,Performance
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of parallel input - output (I/O) performance optimization in high - performance computing (HPC) systems. Specifically, the paper explores the following aspects of problems: 1. **Performance degradation due to increased complexity**: - With the development of artificial intelligence, data science, and high - resolution simulation, the I/O workloads and hardware of HPC systems are becoming more and more complex. - This complexity may lead to large I/O overheads and overall performance degradation. 2. **Difficulty in tool and technology selection**: - There are currently a large number of tools and techniques for characterizing, analyzing, and optimizing I/O behavior, making it difficult for users to choose the most suitable method. - Users need a comprehensive resource to help them understand these tools and techniques and make informed choices. 3. **Lack of systematic knowledge integration**: - There is a lack of a comprehensive and systematic resource to guide the HPC community on how to effectively evaluate and optimize parallel I/O performance. - Existing literature is scattered across multiple databases and journals, lacking a unified framework to integrate this information. ### Specific objectives of the paper To address the above problems, by surveying 131 related papers, the paper provides a comprehensive review, aiming to: - **Construct a classification system**: Through a comprehensive analysis of existing literature, construct a classification system (taxonomy) for parallel I/O characterization, analysis, and optimization to help users better understand and choose appropriate tools and techniques. - **Provide detailed comparison and evaluation**: Provide a detailed description of different levels of the HPC I/O stack and analyze the impact of various I/O access patterns on performance. - **Guide practical applications**: Provide clear guidance for end - users of HPC systems to help them make more efficient and informed decisions when evaluating and optimizing parallel I/O performance. ### Main content The paper covers the following main contents: 1. **Introduction to the HPC I/O stack**: - Describes the different levels of the HPC I/O stack, including high - level I/O libraries, parallel I/O middleware, low - level I/O libraries, I/O forwarding layers, and parallel file systems (PFS). - Analyzes the functions of each layer and their impact on performance. 2. **Classification system for parallel I/O evaluation and optimization**: - Proposes a node - link hierarchical tree diagram to describe the classification system for parallel I/O evaluation and optimization. - Includes key stages such as workload generation, data monitoring and collection, performance analysis, and optimization. 3. **Specific evaluation methods and tools**: - Introduces multiple benchmarking tools for evaluating parallel I/O performance, such as IOR, MDTest, fio, Elbencho, IOzone, etc. - Compares the characteristics, application scenarios, and limitations of these tools in detail. 4. **Application - level evaluation and optimization**: - Explores how to generate I/O workloads through simulation frameworks, proxy applications, etc. - Analyzes different types of application - level benchmarking, such as h5bench, DLIO, HACC - IO, FLASH I/O, etc. Through these contents, the paper provides a valuable resource for the HPC community, helping them conduct research and practice more systematically and efficiently when facing complex parallel I/O performance optimization problems.