Enabling Massive XML-Based Biological Data Management in HBase
Jian Liu,Qiuru Liu,Lei Zhang,Shuhui Su,Yongzhuang Liu
DOI: https://doi.org/10.1109/TCBB.2019.2915811
2020-01-01
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract:AbstractPublishing biological data in XML formats is attractive for organizations who would like to provide their bioinformatics resources in an extensible and machine-readable format. In the era of big data, massive XML-based biological data management is emerged as a challengeable issue. With the continuous growth of the XML-based biological data sets, it is usually frustrating to use traditional declarative query languages to provide efficient query capabilities in terms of processing speed and scale. In this study, we report a novel platform to store and query massive XML-based biological data collections. A prototype tool for constructing HBase tables from XML-based biological data collections is first developed, and then a formal approach to transform the XML query model into the MapReduce query model is proposed. Finally, an evaluation of the query performance of the proposed approach on the existing XML-based biological databases is presented, showing that the performance advantages of the proposed solution. The source code of the massive XML-based biological data management platform is freely available at https://github.com/lyotvincent/X2H.