Abstract:BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases -- queries -- which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1GB to 10TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.

Raven: Benchmarking Monetary Expense and Query Efficiency of OLAP Engines on the Cloud

Visual Analysis of Cloud Computing Performance Using Behavioral Lines

The performance of MapReduce: an in-depth study

Characterizing BigBench queries, Hive, and Spark in multi-cloud environments

The Performance of MapReduce

Saving Money for Analytical Workloads in the Cloud

Accelerating R-based Analytics on the Cloud

Monbench: A Database Performance Benchmark for Cloud Monitoring System

CASH: A Credit Aware Scheduling for Public Cloud Platforms

Cloud Server Benchmarks for Performance Evaluation of New Hardware Architecture

Towards Optimizing Storage Costs on the Cloud

Cloud Performance Modeling with Benchmark Evaluation of Elastic Scaling Strategies

Providing Scalable Database Services On The Cloud

Cost-effective Data Analytics Across Multiple Cloud Regions

PRIMEBALL: a Parallel Processing Framework Benchmark for Big Data Applications in the Cloud

A Benchmarking Framework for Interactive 3D Applications in the Cloud

Cloud BI: Future of Business Intelligence in the Cloud

Efficient B-tree Based Indexing for Cloud Data Processing.

Cloud Benchmarking For Maximising Performance of Scientific Applications

Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRAD -- Extended Version