Abstract:The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems focus on efficiently processing matrix-only operations, and assume that the inputs to the relational operators are already pre-computed and are materialized as intermediate matrices. However, the input to a relational operator may be complex in machine learning pipelines, and may involve various combinations of matrix operators. Hence, it is critical to realize scalable and efficient relational query processors that directly operate on big matrix data. This paper presents new efficient and scalable relational query processing techniques on big matrix data for in-memory distributed clusters. The proposed techniques leverage algebraic transformation rules to rewrite query execution plans into ones with lower computation costs. A distributed query plan optimizer exploits the sparsity-inducing property of merge functions as well as Bloom join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes for the input matrices are developed to facilitate the performance of join operations based on a cost model that minimizes the communication overhead.The proposed relational query processing techniques are prototyped in Apache Spark. Experiments on both real and synthetic data demonstrate that the proposed techniques achieve up to two orders of magnitude performance improvement over state-of-the-art systems on a wide range of applications.

Index-Based OLAP Aggregation for In-Memory Cluster Computing

Towards the Building of a Dense-Region-based OLAP System

Data Warehouse Native Feature Based OLAP Querying with Keywords

A Multidimensional OLAP Engine Implementation in Key-Value Database Systems.

NUMA-Aware Scalable and Efficient In-Memory Aggregation on Large Domains

Requirement-Based Data Cube Schema Design

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

AQP++: Connecting Approximate Query Processing with Aggregate Precomputation for Interactive Analytics

In-Memory Indexed Caching for Distributed Data Processing

Regression Cubes with Lossless Compression and Aggregation

Revisiting Aggregation for Data Intensive Applications: A Performance Study

A Practice Of Tpc-Ds Multidimensional Implementation On Nosql Database Systems

R-Store: A scalable distributed system for supporting real-time analytics

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

Data Compression for Analytics over Large-scale In-memory Column Databases

Scalable Relational Query Processing on Big Matrix Data

Data Caching for Enterprise-Grade Petabyte-Scale OLAP

Accelerating Relational Database Analytical Processing with Bulk-Bitwise Processing-in-Memory

<i>Cool</i>: a COhort OnLine analytical processing system