Abstract:As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory access and data sharing are becoming performance bottlenecks. Cloud computing employs a data processing paradigm typically built on a loosely connected group of low-cost computing nodes without relying upon shared storage and/or memory. Apache Spark is a popular engine for large-scale data analysis in the cloud, which we have successfully deployed via job submission scripts on production clusters. In this paper, we describe common parallel analysis dataflows for both Message Passing Interface (MPI) and cloud based applications. We developed an effective benchmark to measure the performance characteristics of these tasks using both types of systems, specifically comparing MPI/C-based analyses with Spark. The benchmark is a data processing pipeline representative of a typical analytics framework implemented using map-reduce. In the case of Spark, we also consider whether language plays a role by writing tests using both Python and Scala, a language built on the Java Virtual Machine (JVM). We include performance results from two large systems at Argonne National Laboratory including Theta, a Cray XC40 supercomputer on which our experiments run with 65,536 cores (1024 nodes with 64 cores each). The results of our experiments are discussed in the context of their applicability to future HPC architectures. Beyond understanding performance, our work demonstrates that technologies such as Spark, while typically aimed at multi-tenant cloud-based environments, show promise for data analysis needs in a traditional clustering/supercomputing environment.

Parallel K-means Algorithm for Massive Texts on Spark

Distributed High-Dimension Matrix Operation Optimization on Spark

Parallel Text Categorization of Massive Text Based on Hadoop

A Parallel Implementation of the K-Means Algorithm Based on MapReduce

Study of ELM Algorithm Parallelization Based on Spark

Parallel multi-label K-nearest neighbor algorithm based on Spark

K-Means Parallel Acceleration for Sparse Data Dimensions on Flink

Parallelization of Classification Algorithms Based on SparkR

An Improved K-means Distributed Clustering Algorithm Based on Spark Parallel Computing Framework

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

Parallel Graph Pattern Matching in Massive Networks Based on MapReduce

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Parallelization of Machine Learning Algorithms Respectively on Single Machine and Spark

Single Large-Scale Graph Frequent Subgraph Algorithm Based on Spark

In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model

A Parallel Graph Data Analysis System Based on Spark *

Evaluating Large Graph Processing in MapReduce Based on Message Passing

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

A Parallel K-means Algorithm for High Dimensional Text Data

Migrating GIS Big Data Computing from Hadoop to Spark: an Exemplary Study Using Twitter

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers