Abstract:Sustainability research faces many challenges as respective environmental, urban and regional contexts are experiencing rapid changes at an unprecedented spatial granularity level, which involves growing massive data and the need for spatial relationship detection at a faster pace. Spatial join is a fundamental method for making data more informative with respect to spatial relations. The dramatic growth of data volumes has led to increased focus on high-performance large-scale spatial join. In this paper, we present Spatial Join with Spark (SJS), a proposed high-performance algorithm, that uses a simple, but efficient, uniform spatial grid to partition datasets and joins the partitions with the built-in join transformation of Spark. SJS utilizes the distributed in-memory iterative computation of Spark, then introduces a calculation-evaluating model and in-memory spatial repartition technology, which optimize the initial partition by evaluating the calculation amount of local join algorithms without any disk access. We compare four in-memory spatial join algorithms in SJS for further performance improvement. Based on extensive experiments with real-world data, we conclude that SJS outperforms the Spark and MapReduce implementations of earlier spatial join approaches. This study demonstrates that it is promising to leverage high-performance computing for large-scale spatial join analysis. The availability of large-sized geo-referenced datasets along with the high-performance computing technology can raise great opportunities for sustainability research on whether and how these new trends in data and technology can be utilized to help detect the associated trends and patterns in the human-environment dynamics.

Research on Join Operation of Temporal Big Data in Distributed Environment

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Distributed High-Dimension Matrix Operation Optimization on Spark

Performance Evaluation for Distributed Join Based on MapReduce.

Distributed In-Memory Analytics For Big Temporal Data

ITISS: an Efficient Framework for Querying Big Temporal Data.

Research on Temporal Query Expansion and Temporal Index Optimization Based on Spark

The Research of Distributed Shared Memory Technology in Power System

An Efficient Theta-Join Query Processing in Distributed Environment

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability

A Study of Performance Optimization Method for Massive Spaito-temporal Data Based on Spatio-temporal Partition Clustering

Distributed scheduling and storage scheme based on LSM-OCTree for spatiotemporal stream

A Distributed Join Algorithm on Separated Data Storage

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

Distributed Spatio-Temporal K Nearest Neighbors Join.

A Kind Of New Join Query Method Based On Spatio-Temporal Database

Distributed Top-K Join Queries Optimizing for RDF Datasets

Skyline-Join in Distributed Databases

RelJoin: Relative-cost-based Selection of Distributed Join Methods for Query Plan Optimization

A Survey of Spatio-Temporal Big Data Indexing Methods in Distributed Environment