Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL

Justus Henneberg,Felix Schuhknecht,Philipp Reutter,Nils Brast,Peter Spichtinger
DOI: https://doi.org/10.48550/arXiv.2109.08053
2021-09-16
Abstract:Performing data-intensive analytics is an essential part of modern Earth science. As such, research in atmospheric physics and meteorology frequently requires the processing of very large observational and/or modeled datasets. Typically, these datasets (a) have high dimensionality, i.e. contain various measurements per spatiotemporal point, (b) are extremely large, containing observations over a long time period. Additionally, (c) the analytical tasks being performed on these datasets are structurally complex. Over the years, the binary format NetCDF has been established as a de-facto standard in distributing and exchanging such multi-dimensional datasets in the Earth science community -- along with tools and APIs to visualize, process, and generate them. Unfortunately, these access methods typically lack either (1) an easy-to-use but rich query interface or (2) an automatic optimization pipeline tailored towards the specialities of these datasets. As such, researchers from the field of Earth sciences (which are typically not computer scientists) unnecessarily struggle in efficiently working with these datasets on a daily basis. Consequently, in this work, we aim at resolving the aforementioned issues. Instead of proposing yet another specialized tool and interface to work with atmospheric datasets, we integrate sophisticated NetCDF processing capabilities into the established SparkSQL dataflow engine -- resulting in our system Northlight. In contrast to comparable systems, Northlight introduces a set of fully automatic optimizations specifically tailored towards NetCDF processing. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms the comparable state-of-the-art pipeline by up to a factor of 6x.
Databases,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper attempts to solve several key problems faced in processing atmospheric data sets in earth science research: 1. **Processing of high - dimensional and large - scale data sets**: Modern earth science research often needs to process large multi - dimensional data sets from observations or simulations. These data sets not only contain observational data at a large number of time points, but also have multiple measurements at each spatio - temporal point, resulting in high - dimensional and extremely large - scale data sets. 2. **Structure of complex analysis tasks**: The analysis tasks performed on these data sets are usually complex in structure, which increases the difficulty of data processing. 3. **Limitations of existing tools**: Currently, the NetCDF format is the standard format for the earth science community to distribute and exchange multi - dimensional data sets. Although there are many tools and APIs that can be used for visualizing, processing, and generating these data sets, these tools usually lack easy - to - use and feature - rich query interfaces, or do not have pipelines that are automatically optimized for the characteristics of these data sets. 4. **Needs of researchers who are not computer scientists**: Researchers in the field of earth science are usually not computer scientists. They face difficulties in efficiently processing these data sets and need a convenient and efficient solution. To this end, the paper proposes the Northlight system, which aims to solve the above problems in the following ways: - **Integrating NetCDF processing capabilities into SparkSQL**: Northlight integrates NetCDF data processing capabilities into SparkSQL, providing an easy - to - use declarative query interface while maintaining high - performance data processing capabilities. - **Automatically optimizing queries**: Northlight introduces a fully automated set of optimization strategies, specifically optimized for NetCDF data processing, including vertical and horizontal pruning, non - convex predicate optimization, and optimizing join operations through envelope. - **No need to pre - process data**: Northlight can achieve efficient access and processing of data sources without creating any auxiliary index structures in advance. Through these innovations, Northlight aims to improve the efficiency and convenience of earth science researchers in processing large - scale atmospheric data sets.