Adaptive Indexing for Distributed Array Processing

Yifeng Geng,Xiaomeng Huang,Guangwen Yang
DOI: https://doi.org/10.1109/bigdata.congress.2014.55
2014-01-01
Abstract:Scientists are facing the data deluge in the scientific explorations. Big data are collected by the scientific instruments and experiments. The data are usually multidimensional arrays and stored in many files. Distributed computing techniques such as MapReduce make exploring the large datasets practical. The index is a well-known measure to shorten the query processing duration. Most of existing indexing methods need a full load of the raw data to build the index. In this paper, we proposed a distributed adaptive indexing method for the distributed array-oriented query processing. Our method does not require a full scan of the array data. For each subarray accessed by a subtask, we divide the array into multiple logical blocks with a proper block size. The normal processing routine is executed when handling a query. Meanwhile, the index for the blocks accessed by the query is built at a low cost. So the whole index grows along with processing queries. This incremental manner exploits the accessed data of historical queries and eliminates the long load procedure. The experiments show that our adaptive indexing implemented over Hadoop and Hive is effective for accelerating array-oriented query processing without introducing much overhead.
What problem does this paper attempt to address?