Design and Development of a Java Parallel I/O Library

Muhammad Sohaib Ayub,Muhammad Adnan,Muhammad Yasir Shafi
DOI: https://doi.org/10.48550/arXiv.2305.07414
2023-05-12
Abstract:Parallel I/O refers to the ability of scientific programs to concurrently read/write from/to a single file from multiple processes executing on distributed memory platforms like compute clusters. In the HPC world, I/O becomes a significant bottleneck for many real-world scientific applications. In the last two decades, there has been significant research in improving the performance of I/O operations in scientific computing for traditional languages including C, C++, and Fortran. As a result of this, several mature and high-performance libraries including ROMIO (implementation of MPI-IO), parallel HDF5, Parallel I/O (PIO), and parallel netCDF are available today that provide efficient I/O for scientific applications. However, there is very little research done to evaluate and improve I/O performance of Java-based HPC applications. The main hindrance in the development of efficient parallel I/O Java libraries is the lack of a standard API (something equivalent to MPI-IO). Some adhoc solutions have been developed and used in proprietary applications, but there is no general-purpose solution that can be used by performance hungry applications. As part of this project, we plan to develop a Java-based parallel I/O API inspired by the MPI-IO bindings (MPI 2.0 standard document) for C, C++, and Fortran. Once the Java equivalent API of MPI-IO has been developed, we will develop a reference implementation on top of existing Java messaging libraries. Later, we will evaluate and compare performance of our reference Java Parallel I/O library with C/C++ counterparts using benchmarks and real-world applications.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of insufficient parallel I/O performance of Java in High Performance Computing (HPC). Specifically, the paper focuses on the following points: 1. **Lack of Standard API**: Currently, Java lacks a standard API for parallel I/O similar to MPI-IO in C, C++, and Fortran. This leads to difficulties in developing efficient parallel I/O libraries. 2. **Performance Bottleneck**: In HPC applications, disk I/O remains a significant bottleneck. Although computational speed has greatly increased, the speed of parallel I/O has not kept up, resulting in a large performance gap. 3. **Limitations of Existing Solutions**: While there are some ad-hoc solutions for specific applications, these solutions are not general and cannot meet the needs of high-performance applications. ### Solution To address the above issues, the paper plans to develop a Java-based parallel I/O API inspired by the MPI-IO specification and aims to achieve the following goals: 1. **Design API**: Develop a Java parallel I/O API compatible with MPI-IO to provide efficient parallel read and write operations. 2. **Reference Implementation**: Implement a reference version of the parallel I/O library based on existing Java message-passing libraries. 3. **Performance Evaluation**: Evaluate and compare the performance of the Java parallel I/O library with C/C++ libraries through benchmarks and real-world applications. ### Main Contributions - **Standardized API**: Provide a standardized Java parallel I/O API, filling the gap in existing research. - **Efficient Implementation**: Improve the parallel I/O performance of Java in HPC applications through optimized implementation. - **Performance Comparison**: Validate the effectiveness and advantages of the Java parallel I/O library through detailed performance testing. Through these efforts, the paper hopes to promote the application of Java in the field of high-performance computing, particularly in the development of parallel I/O.