Abstract:The amount of data managed in today's Cloud systems has reached an unprecedented scale. In order to speed up query processing, an effective mechanism is to build indexes on attributes that are used in query predicates. However, conventional indexing schemes fail to provide a scalable service: as the size of these indexes are proportional to the data size, it is not space efficient to build many indexes. As such, it becomes more crucial to develop effective index to provide scalable database services in the Cloud. In this paper, we propose a compact bitmap indexing scheme for a large-scale data store. The bitmap indexing scheme combines state-of-the-art bitmap compression techniques, such as WAH encoding and bit-sliced encoding. To further reduce the index cost, a novel and query efficient partial indexing technique is adopted, which dynamically refreshes the index to handle updates and process queries. The intuition of our indexing approach is to maximize the number of indexed attributes, so that a wider range of queries, including range and join queries, can be efficiently supported. Our indexing scheme is light-weight and its creation can be seamlessly grafted onto the MapReduce processing engine without incurring significant running cost. Moreover, the compactness allows us to maintain the bitmap indexes in memory so that performance overhead of index access is minimal. We implement our indexing scheme on top of the underlying Distributed File System (DFS) and evaluate its performance on an in-house cluster. We compare our index-based query processing with HadoopDB to show its superior performance. Our experimental results confirm the effectiveness, efficiency and scalability of the indexing scheme.

In-Memory Indexed Caching for Distributed Data Processing

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

TSCache

Adaptive Indexing for Distributed Array Processing

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

An Improved Memory Cache Management Study Based on Spark

Efficient B-tree Based Indexing for Cloud Data Processing.

Data Object Cache in Spark Computing Engine

Index-Based OLAP Aggregation for In-Memory Cluster Computing

Adaptive memory reservation strategy for heavy workloads in the Spark environment

Practical Near-Data Processing for In-Memory Analytics Frameworks

Pangea: Monolithic Distributed Storage for Data Analytics

Memory optimization of Spark parallel computing framework

SAC: Dynamic Caching Upon Sketch for In-Memory Big Data Analytics

Improving Spark Performance with Zero-Copy Buffer Management and RDMA

A Scalable Learned Index Scheme in Storage Systems

MicroStream: A Distributed In-memory Caching Service for Data Production

Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data Applications

In-memory big data analytics under space constraints using dynamic programming.