Abstract:Data warehouse systems, like Apache Hive, have been widely used in the distributed computing field. However, current generation data warehouse systems have not fully embraced High Performance Computing (HPC) technologies even though the trend of converging Big Data and HPC is emerging. For example, in traditional HPC field, Message Passing Interface (MPI) libraries have been optimized for HPC applications during last decades to deliver ultra-high data movement performance. Recent studies, like DataMPI, are extending MPI for Big Data applications to bridge these two fields. This trend motivates us to explore whether MPI can benefit data warehouse systems, such as Apache Hive. In this paper, we propose a novel design to accelerate Apache Hive by utilizing DataMPI. We further optimize the DataMPI engine by introducing enhanced non-blocking communication and parallelism mechanisms for typical Hive workloads based on their communication characteristics. Our design can fully and transparently support Hive workloads like Intel HiBench and TPC-H with high productivity. Performance evaluation with Intel HiBench shows that with the help of light-weight DataMPI library design, efficient job start up and data movement mechanisms, Hive on DataMPI performs 30% faster than Hive on Hadoop averagely. And the experiments on TPC-H with ORCFile show that the performance of Hive on DataMPI can improve 32% averagely and 53% at most more than that of Hive on Hadoop. To the best of our knowledge, Hive on DataMPI is the first attempt to propose a general design for fully supporting and accelerating data warehouse systems with MPI.

HDW: A High Performance Large Scale Data Warehouse

Parallel Data Warehouses Architecture Based on PC Cluster

Accelerating Apache Hive with MPI for Data Warehouse Systems

Architecture Design of Wide-area Distributed Real-time Database in Grid Dispatching Control System

Hierarchically Distributed Data Warehouse

Design Of Distributed Data Warehouse For Sales Decision Of Large-Scale Clothing Enterprise

Research on storage and query of large-scale multidimensional data.

H-DB: Yet Another Big Data Hybrid System of Hadoop and DBMS

HBaseSpatial: A Scalable Spatial Data Storage Based on HBase

A High Performance Query Analytical Framework for Supporting Data-Intensive Climate Studies

A Proposal of High Performance Data Mining System

HyDB: a High Effective SaaS Architecture by Integrating MapReduce and Database

On Improving GDSS Data Warehouse

G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Efficient Query Processing Framework for Big Data Warehouse: an Almost Join-Free Approach

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

HyDB: Access Optimization for Data-Intensive Service

Designing and Implementing Data Warehouse for Agricultural Big Data

A High Performance Hierarchical Cubing Algorithm And Efficient Olap In High-Dimensional Data Warehouse

Design and Construction of a Big Data Analytics Framework for Health Applications

GVDS:a Global Virtual Data Space for Wide-Area High-Performance Computing Environments