Abstract:Performing data-intensive analytics is an essential part of modern Earth science. As such, research in atmospheric physics and meteorology frequently requires the processing of very large observational and/or modeled datasets. Typically, these datasets (a) have high dimensionality, i.e. contain various measurements per spatiotemporal point, (b) are extremely large, containing observations over a long time period. Additionally, (c) the analytical tasks being performed on these datasets are structurally complex. Over the years, the binary format NetCDF has been established as a de-facto standard in distributing and exchanging such multi-dimensional datasets in the Earth science community -- along with tools and APIs to visualize, process, and generate them. Unfortunately, these access methods typically lack either (1) an easy-to-use but rich query interface or (2) an automatic optimization pipeline tailored towards the specialities of these datasets. As such, researchers from the field of Earth sciences (which are typically not computer scientists) unnecessarily struggle in efficiently working with these datasets on a daily basis. Consequently, in this work, we aim at resolving the aforementioned issues. Instead of proposing yet another specialized tool and interface to work with atmospheric datasets, we integrate sophisticated NetCDF processing capabilities into the established SparkSQL dataflow engine -- resulting in our system Northlight. In contrast to comparable systems, Northlight introduces a set of fully automatic optimizations specifically tailored towards NetCDF processing. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms the comparable state-of-the-art pipeline by up to a factor of 6x.

What problem does this paper attempt to address?

This paper attempts to solve several key problems faced in processing atmospheric data sets in earth science research: 1. **Processing of high - dimensional and large - scale data sets**: Modern earth science research often needs to process large multi - dimensional data sets from observations or simulations. These data sets not only contain observational data at a large number of time points, but also have multiple measurements at each spatio - temporal point, resulting in high - dimensional and extremely large - scale data sets. 2. **Structure of complex analysis tasks**: The analysis tasks performed on these data sets are usually complex in structure, which increases the difficulty of data processing. 3. **Limitations of existing tools**: Currently, the NetCDF format is the standard format for the earth science community to distribute and exchange multi - dimensional data sets. Although there are many tools and APIs that can be used for visualizing, processing, and generating these data sets, these tools usually lack easy - to - use and feature - rich query interfaces, or do not have pipelines that are automatically optimized for the characteristics of these data sets. 4. **Needs of researchers who are not computer scientists**: Researchers in the field of earth science are usually not computer scientists. They face difficulties in efficiently processing these data sets and need a convenient and efficient solution. To this end, the paper proposes the Northlight system, which aims to solve the above problems in the following ways: - **Integrating NetCDF processing capabilities into SparkSQL**: Northlight integrates NetCDF data processing capabilities into SparkSQL, providing an easy - to - use declarative query interface while maintaining high - performance data processing capabilities. - **Automatically optimizing queries**: Northlight introduces a fully automated set of optimization strategies, specifically optimized for NetCDF data processing, including vertical and horizontal pruning, non - convex predicate optimization, and optimizing join operations through envelope. - **No need to pre - process data**: Northlight can achieve efficient access and processing of data sources without creating any auxiliary index structures in advance. Through these innovations, Northlight aims to improve the efficiency and convenience of earth science researchers in processing large - scale atmospheric data sets.

Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL

Visualization And Diagnosis Of Earth Science Data Through Hadoop And Spark

Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark

Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics

Spatial overlay analysis of land use vector data based on Spark

A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability

Mining Area Skyline Objects from Map-based Big Data using Apache Spark Framework

LocationSpark

SciAP: A Programmable, High-Performance Platform for Large-Scale Scientific Data

A High Performance Query Analytical Framework for Supporting Data-Intensive Climate Studies

Migrating GIS Big Data Computing from Hadoop to Spark: an Exemplary Study Using Twitter

Sparknet: Training deep networks in spark

Neural-based Modeling for Performance Tuning of Spark Data Analytics

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

A Lightweight I/O Scheme to Facilitate Spatial and Temporal Queries of Scientific Data Analytics.

FITS Data Source for Apache Spark

Sdac: Porting Scientific Data To Spark Rdds

A high performance web-based system for analyzing and visualizing spatiotemporal data for climate studies

Acceleration strategy of feature extraction based on remote sensing big data in Spark

SCASA: A Spark-Based Parallel Approach for Net Primary Productivity Calculation with CASA Model

A Framework of Distributed Spatial Data Analysis Based on Shark/Spark