Hybrid storage architecture and efficient MapReduce processing for unstructured data
Weiming Lu,Yaoguang Wang,Jingyuan Jiang,Jian Liu,Yapeng Shen,Baogang Wei
DOI: https://doi.org/10.1016/j.parco.2017.08.008
IF: 0.983
2017-11-01
Parallel Computing
Abstract:As we are now entering the era of data deluge, how to efficiently manage these massive data is becoming a great challenge, especially for the exponentially growing unstructured data, which is far more than structured and semi-structured data. However, unstructured data is more complex for its variety. That is to say, different types of unstructured data have different file size, type and usage, which need different storage and processing for high efficiency. In this paper, we propose a hybrid storage architecture to store the pervasive unstructured data. This hybrid architecture integrates various kinds of data stores within a unified framework, where each type of unstructured data can find its suitable placement policy and it is transparent to users. In addition, we present several partitioning strategies based on the unified framework, which are beneficial to the MapReduce-based batch processing for these unstructured data. The experiments demonstrate that it is possible to build an efficient and smart system through the hybrid architecture and the partitioning strategies.
computer science, theory & methods