Visualization And Diagnosis Of Earth Science Data Through Hadoop And Spark

Shujia Zhou,Xiaowen Li,Toshihisa Matsui,Weikuo Tao
DOI: https://doi.org/10.1109/BigData.2016.7840949
2016-01-01
Abstract:Large data (over Terabyte) are produced by ultra high-resolution Earth science simulations with a long period of time. This creates a challenge to distribute and analyze in an effective, efficient, and scalable way. One key reason is that typical Earth science data are represented in NetCDF, which is not supported by the popular and powerful Hadoop Distribute File System (HDFS) and consequently cannot be analyzed with tools based on HDFS. In this paper, we report a system for visualizing and analyzing Earth science data based on Hadoop and Spark. It transforms data from the format of NetCDF to CSV (Comma Separated Value) that is supported by HDFS and indexes data appropriately to save storage space as well as manipulate flexibly through HIVE, Impala, and SparkSQL. Adaptive subsetting and visualization of cloud resolve model simulation data are used to validate and demonstrate the features of this system.
What problem does this paper attempt to address?