Abstract:As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory access and data sharing are becoming performance bottlenecks. Cloud computing employs a data processing paradigm typically built on a loosely connected group of low-cost computing nodes without relying upon shared storage and/or memory. Apache Spark is a popular engine for large-scale data analysis in the cloud, which we have successfully deployed via job submission scripts on production clusters. In this paper, we describe common parallel analysis dataflows for both Message Passing Interface (MPI) and cloud based applications. We developed an effective benchmark to measure the performance characteristics of these tasks using both types of systems, specifically comparing MPI/C-based analyses with Spark. The benchmark is a data processing pipeline representative of a typical analytics framework implemented using map-reduce. In the case of Spark, we also consider whether language plays a role by writing tests using both Python and Scala, a language built on the Java Virtual Machine (JVM). We include performance results from two large systems at Argonne National Laboratory including Theta, a Cray XC40 supercomputer on which our experiments run with 65,536 cores (1024 nodes with 64 cores each). The results of our experiments are discussed in the context of their applicability to future HPC architectures. Beyond understanding performance, our work demonstrates that technologies such as Spark, while typically aimed at multi-tenant cloud-based environments, show promise for data analysis needs in a traditional clustering/supercomputing environment.

Computational Performance Analysis of Cluster-based Technologies for Big Data Analytics

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Distributed High-Dimension Matrix Operation Optimization on Spark

Visual Analysis of Cloud Computing Performance Using Behavioral Lines

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Performance evaluation of Map-reduce jar pig hive and spark with machine learning using big data

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Analysis of Distributed Algorithms for Big-data

Comparative analysis of Spark and Hadoop through Imputation of Data on Big Datasets

Evaluation and Analysis of Distributed Graph-Parallel Processing Frameworks

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

Analysis of Big Data Platform with OpenStack and Hadoop.

A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

A survey of data partitioning and sampling methods to support big data analysis

Performance prediction of parallel computing models to analyze cloud-based big data applications

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Unstructured Data Analysis on Big Data Using Map Reduce