Tarazu: An Adaptive End-to-End I/O Load Balancing Framework for Large-Scale Parallel File Systems

Arnab K. Paul,Sarah Neuwirth,Bharti Wadhwa,Feiyi Wang,Sarp Oral,Ali R. Butt
DOI: https://doi.org/10.1145/3641885
2024-02-01
ACM Transactions on Storage
Abstract:The imbalanced I/O load on large parallel file systems affects the parallel I/O performance of high-performance computing (HPC) applications. One of the main reasons for I/O imbalances is the lack of a global view of system-wide resource consumption. While approaches to address the problem already exist, the diversity of HPC workloads combined with different file striping patterns prevents widespread adoption of these approaches. In addition, load balancing techniques should be transparent to client applications. To address these issues, we propose Tarazu, an end-to-end control plane where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages real-time load statistics for global data placement on distributed storage servers, while our design model employs trace-based optimization techniques to minimize latency for I/O load requests between clients and servers and to handle multiple striping patterns in files. We evaluate our proposed system on an experimental cluster for two common use cases: the synthetic I/O benchmark IOR and the scientific application I/O kernel HACC-I/O. We also use a discrete-time simulator with real HPC application traces from emerging workloads running on the Summit supercomputer to validate the effectiveness and scalability of Tarazu in large-scale storage environments. The results show improvements in load balancing and read performance of up to \(33\% \) and \(43\% \) percent, respectively, compared to the state of the art.
computer science, software engineering, hardware & architecture
What problem does this paper attempt to address?
The paper attempts to address the issue of I/O load imbalance in large-scale parallel file systems. Specifically: 1. **I/O Load Imbalance**: In high-performance computing (HPC) applications, uneven distribution of I/O load can affect parallel I/O performance. The main reason is the lack of a global view of system resource consumption. 2. **Limitations of Existing Methods**: - **Diversity of Workloads**: Existing solutions struggle to adapt to diverse HPC workloads and different file striping patterns. - **Transparency**: Load balancing techniques should be transparent to client applications. - **Centralized Prediction Algorithms**: Existing centralized prediction algorithms limit the scalability of the load balancing framework. - **File Layout**: Inefficient file striping patterns can lead to uneven utilization of storage components, even causing some storage targets to be completely filled. 3. **Objective**: Propose an end-to-end control plane (Tarazu) that optimizes data placement for applications with different I/O request sizes through intelligent and adaptive placement algorithms, considering the current load of the file system. Tarazu aims to handle various file striping patterns, support scientific code development, and efficiently utilize large-scale parallel I/O and storage resources. The paper addresses the aforementioned issues by designing and implementing Tarazu, and evaluates it on an experimental cluster, demonstrating significant improvements in load balancing and read performance.