Abstract:Big data areas are expanding in a fast way in terms of increasing workloads and runtime systems, and this situation imposes a serious challenge to workload characterization, which is the foundation of innovative system and architecture design. The previous major efforts on big data benchmarking either propose a comprehensive but a large amount of workloads, or only select a few workloads according to so-called popularity, which may lead to partial or even biased observations. In this paper, on the basis of a comprehensive big data benchmark suite---BigDataBench, we reduced 77 workloads to 17 representative workloads from a micro-architectural perspective. On a typical state-of-practice platform---Intel Xeon E5645, we compare the representative big data workloads with SPECINT, SPECCFP, PARSEC, CloudSuite and HPCC. After a comprehensive workload characterization, we have the following observations. First, the big data workloads are data movement dominated computing with more branch operations, taking up to 92% percentage in terms of instruction mix, which places them in a different class from Desktop (SPEC CPU2006), CMP (PARSEC), HPC (HPCC) workloads. Second, corroborating the previous work, Hadoop and Spark based big data workloads have higher front-end stalls. Comparing with the traditional workloads i. e. PARSEC, the big data workloads have larger instructions footprint. But we also note that, in addition to varied instruction-level parallelism, there are significant disparities of front-end efficiencies among different big data workloads. Third, we found complex software stacks that fail to use state-of-practise processors efficiently are one of the main factors leading to high front-end stalls. For the same workloads, the L1I cache miss rates have one order of magnitude differences among diverse implementations with different software stacks.

Analysis of Big Data Platform with OpenStack and Hadoop.

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Design and Implementation of Clinical Data Center Based on Hadoop

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Power Big Data Analysis Platform Design Based on Hadoop

Visual Analysis of Cloud Computing Performance Using Behavioral Lines

The performance of MapReduce: an in-depth study

Efficient Support of Big Data Storage Systems on the Cloud

The Performance of MapReduce

Performance optimization of computing task scheduling based on the Hadoop big data platform

The Design and Implementation of Geographic Information Storage System Based on the Cloud Platform.

Building a Productive Domain-Specific Cloud for Big Data Processing and Analytics Service

Characterization and Architectural Implications of Big Data Workloads

Big Data Storage Architecture Design in Cloud Computing

Log analysis in cloud computing environment with Hadoop and Spark

Block Storage Optimization and Parallel Data Processing and Analysis of Product Big Data Based on the Hadoop Platform

Introduction to Harp: when Big Data Meets HPC

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

A MapReduce Cluster Deployment Optimization Framework with Geo-distributed Data.

Location-Aware Data Block Allocation Strategy for HDFS-Based Applications in the Cloud