TwigStack-MR: An Approach to Distributed XML Twig Query Using MapReduce

Hongjie Fan,Han Yang,Zhiyi Ma,Junfei Liu
DOI: https://doi.org/10.1109/BigDataCongress.2016.79
2016-01-01
Abstract:Twig pattern query is the core operation of XML process, which directly affects the efficiency of XML data query. It is a challenge to manipulate massive XML data, especially on distributed cluster, such as how to effectively ensure the completeness and correctness of the query results, and minimize communication costs between the various machines. In this paper, we present TwigStack-MR, which simultaneously processes several twig pattern queries for a massive volume of XML data based on MapReduce framework. We first split the large scale XML data file into file-splits as input to the distributed storage system. Then we present the distributed twig algorithm, processing different subtrees of the document tree in parallel. Finally we use the MapReduce framework, full characteristics of distributed environments, to process twig query efficiently. The experimental results show that our approach is efficient and scalable on this issue.
What problem does this paper attempt to address?