Abstract:Numerical weather forecast is a most efficient means to reduce the effects of unexpected weather events. With the increasing prediction precision and the time-critical requirement, technologies of high performance computing have been improved much. However, I/O has become a significant performance bottleneck when scaling up to thousands of processes. In this paper we analyze the I/O access patterns of GRAPES (Global/Regional Assimilation and Prediction System) for numerical weather prediction system as a case of regular multi-dimensional data access. And we implement two parallel I/O strategies based on MPI-IO and ADIOS (Adaptive I/O System), making full use of efficient synchronous I/O schemes. For ADIOS, the "MPLAMR" method is employed to improve the parallel output bandwidth, which uses aggregator processes to execute I/O operations and write to one subfile on one OST for each aggregators, reducing I/O conflicts. Experiments show that the two optimizations outperform the original sequential I/O access, achieving very impressive improvements on Tianhe-1A system and Subway Bluelight system in China. The I/O cost based on ADIOS only accounts for no more than 9% scaling up to 2K processes on Tianhe-1A system, while the sequential I/O costs more than 50% of total time when scaling to 1K or more processes. It is also found that the aggregate output based on ADIOS achieves better output performance improvements, whose peak reaches 3.84 GB/s with one time-step output on Tianhe-1A system. On the contrary, MPI-IO has obtained good input performance improvements, whose peak reaches 4.55 GB/s.We use the GRAPES's I/O component as a benchmark to make a further study on I/O performance using ADIOS. From the rules found, we can design an efficient scheme of using "MPLAMR" for ADIOS on Tianhe-1A system. We take 15-km horizontal resolution for instance. Since the maximum number of OSTs available for our test on Tianhe-1A system is no more than 80,32 or 64 OSTs are chosen to facilitate parallel I/O. Then the number of aggregators should be set as 64 or 128. The optimal data size of 114 MB on one OST on Tianhe-1A system can be tested by simple cases. If we use 32 OSTs with 1024 processes, then 4 time-step aggregation can be calculated out, which obtains optimal I/O performance under such number of OSTs. It is true of the situation of 64 OSTs used. Hence, time-step aggregation is useful for output optimization based on "MPLAMR", whose peak reaches 7.69 GB/s on 2K processes with 64 OSTs and 128 aggregators if 8 time-step aggregation is used.We also examine the performance effects of data layout in the Lustre file system based on MPI-IO, which implies that data distribution on more OSTs outperforms a limited number of OSTs used, while the I/O performance is more likely to be disturbed with data distributed on most of all the OSTs. This influence is more apparent based on MPI-IO compared with ADIOS. (C) 2014 Elsevier B.V. All rights reserved.

A Case Study of Large-Scale Parallel I/O Analysis and Optimization for Numerical Weather Prediction System.

A Numerical Model Oriented Large-scale Parallel I/O Optimization Method.

Parallel I/O Optimization for High Resolution Ocean Model LICOM2

Effectively Mitigating I/O Inactivity In Vcpu Scheduling

accelerating wrf i/o performance with adios2 and network-based streaming

Parallel Optimization for Large-Scale Ocean Data Assimilation

High Performance Parallel I/O and In-Situ Analysis in the WRF Model with ADIOS2

An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems

Development and performance optimization of a parallel computing infrastructure for an unstructured-mesh modelling framework

Parallel Contributing Area Calculation with Granularity Control on Massive Grid Terrain Datasets

A Two-Level Parallel Decomposition Approach for Transient Stability Constrained Optimal Power Flow

Optimized Data I/O Strategy of the Algorithm of Parallel Digital Terrain Analysis

I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis

A cost-aware region-level data placement scheme for hybrid parallel I/O systems

Parallelize Over Data Particle Advection: Participation, Ping Pong Particles, and Overhead

Streaming Data in HPC Workflows Using ADIOS

An MPI+OpenACC-based PRM Scalar Advection Scheme in the GRAPES Model over a Cluster with Multiple CPUs and GPUs

Output Performance Study on a Production Petascale Filesystem.

Optimizing Parallel I/O Accesses Through Pattern-Directed and Layout-Aware Replication

A Dynamic Data Partition Algorithm Oriented to MPI and OpenMP1

Improving Parallel Performance of A Finite-Difference Agcm on Modern High-Performance Computers