A Data Optimizer for Region-Aware Self-describing Files in Scientific Computing

Yanjie Song,Tianyuan Wu,Yuanhao Li,Guancheng Li,Yuchen Liu,Shu Yin,Wei Xue,Junchao Wang
DOI: https://doi.org/10.1145/3698038.3698526
2024-01-01
Abstract:Acquiring data from scientific simulations for analytical purposes is inherently challenging due to the complex and irregularly shaped regions within which the data resides, particularly when using self-describing data formats. The process of region-based data distillation becomes even more arduous when employing persistent memory or parallel file systems. To tackle this challenge, we introduce RASTER (Region-Aware Self-describing daTa optimizER), a lightweight middleware designed for region-aware data preprocessing. RASTER dynamically reorganizes data into variable groups based on regional identifiers during runtime, thereby eliminating the need for sequential searches to locate the required data. We have developed a prototype of RASTER and successfully integrated it into three distinct computing environments: a single-node server equipped with Intel® Optane™ DC persistent memory, the Huawei® OceanStor cloud storage platform, and the Sunway TaihuLight supercomputer. We then conducted a thorough evaluation of the RASTER prototype on the latter two platforms using a real-world scientific application, CESM (Community Earth System Model). Our experimental results demonstrate that RASTER enhances data acquisition performance by up to 2.83× and achieves a 2.36× speedup over conventional netCDF and the state-of-the-art ADIOS2. Additionally, RASTER significantly reduces memory usage by up to 400%, showcasing its scalability potential.
What problem does this paper attempt to address?