Abstract:In the era of big data, Internet-based geospatial information services such as various LBS apps are deployed everywhere, followed by an increasing number of queries against the massive spatial data. As a result, the traditional relational spatial database (e.g., PostgreSQL with PostGIS and Oracle Spatial) cannot adapt well to the needs of large-scale spatial query processing. Spark is an emerging outstanding distributed computing framework in the Hadoop ecosystem. This paper aims to address the increasingly large-scale spatial query-processing requirement in the era of big data, and proposes an effective framework GeoSpark SQL, which enables spatial queries on Spark. On the one hand, GeoSpark SQL provides a convenient SQL interface; on the other hand, GeoSpark SQL achieves both efficient storage management and high-performance parallel computing through integrating Hive and Spark. In this study, the following key issues are discussed and addressed: (1) storage management methods under the GeoSpark SQL framework, (2) the spatial operator implementation approach in the Spark environment, and (3) spatial query optimization methods under Spark. Experimental evaluation is also performed and the results show that GeoSpark SQL is able to achieve real-time query processing. It should be noted that Spark is not a panacea. It is observed that the traditional spatial database PostGIS/PostgreSQL performs better than GeoSpark SQL in some query scenarios, especially for the spatial queries with high selectivity, such as the point query and the window query. In general, GeoSpark SQL performs better when dealing with compute-intensive spatial queries such as the kNN query and the spatial join query.

Integrating Apache Spark and External Data Sources Using Hadoop Interfaces

Design and Implementation of Clinical Data Integration and Management System Based on Hadoop Platform

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

A Survey on Spark Ecosystem for Big Data Processing

Comparative Study on MapReduce and Spark for Big Data Analytics

Design and Implementation of Real Time Data Processing System Based on Spark Streaming

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

GeoSpark SQL: an Effective Framework Enabling Spatial Queries on Spark

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

LotusSQL: SQL Engine for High-Performance Big Data Systems

Performance Analysis of Distributed Computing Frameworks for Big Data Analytics: Hadoop Vs Spark

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Data Processing Framework Using Apache and Spark Technologies in Big Data

Integrating Heterogeneous Stream and Historical Data Sources using SQL

Optimization of Spark Storage Solutions

Design and Development of a Big Data Platform for Disease Burden Based on the Spark Engine

Indexing for Large Scale Data Querying Based on Spark SQL

Framing Apache Spark in life sciences

H-DB: Yet Another Big Data Hybrid System of Hadoop and DBMS

Bioinformatics Applications on Apache Spark

Bioinformatics applications on Apache Spark.