Distributed XPath Query Processing over Large XML Data Based on MapReduce Framework

Hongjie Fan,Dongsheng Wang,Junfei Liu
DOI: https://doi.org/10.1109/fskd.2016.7603390
2016-01-01
Abstract:The volume of XML data is tremendous in many areas, especially in data logging and scientific areas. XPath query is the core operation of XML process. It is a challenge to query massive XML data stored in a distributed manner. In this paper, we present an efficient distributed XPath query processing using MapReduce, which simultaneously processes queries for a massive volume of XML data. We first use virtual nodes to split the large scale XML data file into filesplits to the distributed storage system. Then we present the distributed XPath query algorithm to compute different fragments of the document tree in parallel using the MapReduce framework. Furthermore, in order to handle the large XML data efficiently, we build the partitional index and use random access mechanism to perform the query. The experimentation shows that our approach is efficient and scalable on this issue.
What problem does this paper attempt to address?