Abstract:Data warehouse systems, like Apache Hive, have been widely used in the distributed computing field. However, current generation data warehouse systems have not fully embraced High Performance Computing (HPC) technologies even though the trend of converging Big Data and HPC is emerging. For example, in traditional HPC field, Message Passing Interface (MPI) libraries have been optimized for HPC applications during last decades to deliver ultra-high data movement performance. Recent studies, like DataMPI, are extending MPI for Big Data applications to bridge these two fields. This trend motivates us to explore whether MPI can benefit data warehouse systems, such as Apache Hive. In this paper, we propose a novel design to accelerate Apache Hive by utilizing DataMPI. We further optimize the DataMPI engine by introducing enhanced non-blocking communication and parallelism mechanisms for typical Hive workloads based on their communication characteristics. Our design can fully and transparently support Hive workloads like Intel HiBench and TPC-H with high productivity. Performance evaluation with Intel HiBench shows that with the help of light-weight DataMPI library design, efficient job start up and data movement mechanisms, Hive on DataMPI performs 30% faster than Hive on Hadoop averagely. And the experiments on TPC-H with ORCFile show that the performance of Hive on DataMPI can improve 32% averagely and 53% at most more than that of Hive on Hadoop. To the best of our knowledge, Hive on DataMPI is the first attempt to propose a general design for fully supporting and accelerating data warehouse systems with MPI.

Challenging SQL-on-Hadoop Performance with Apache Druid

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Accelerating Apache Hive with MPI for Data Warehouse Systems

Query optimization for massively parallel data processing.

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Evaluating the Performance of SQL*Plus with Hive for Business

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

On the performance of SQL scalable systems on Kubernetes: a comparative study

E3: an Elastic Execution Engine for Scalable Data Processing.

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Evaluating NoSQL Databases for OLAP Workloads: A Benchmarking Study of MongoDB, Redis, Kudu and ArangoDB

AQP++: Connecting Approximate Query Processing with Aggregate Precomputation for Interactive Analytics

Column-Oriented Storage Techniques for MapReduce

H-DB: Yet Another Big Data Hybrid System of Hadoop and DBMS

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

SciHive: Array-Based Query Processing with HiveQL

<i>Cool</i>: a COhort OnLine analytical processing system

Empirical Analysis on Comparing the Performance of Alpha Miner Algorithm in SQL Query Language and NoSQL Column-Oriented Databases Using Apache Phoenix

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems