HiBase:A Hierarchical Indexing Mechanism and System for Efficient HBase Query

Wei GE,Sheng-Mei LUO,Wen-Hui ZHOU,Di ZHAO,Yun TANG,Juan ZHOU,Wen-Wu QU,Chun-Feng YUAN,Yi-Hua HUANG
DOI: https://doi.org/10.11897/SP.J.1016.2016.00140
2016-01-01
Abstract:Nowadays we enter the big data era.The amount of data is growing explosively in many business areas.There is an urgent need for efficient storage and management of big data to provide realtime or near-realtime query for data analysis.Hadoop HBase provides a technical method and system with excellent scalability for storing and querying big data.However,HBase only provides the row key indexing and does not support non-key indexing,which makes it insufficient to meet the need of realtime or near-realtime applications.In this paper,we proposed a hierarchical secondary indexing model and method for HBase.It built the permanent layer of secondary index for non-key columns in HBase table to speed up the query process.Furthermore, we presented the Hotscore Algorithm with hot-index cache mechanisms and an efficient cache replacement policy,to reduce the disk access overhead for index data.The Hotscore Algorithm overcame the limitations of the Least Recently Used (LRU)policy.To differentiate the hot and cold index data more precisely and fit in the time locality of data accesses,the Hotscore Algorithm presented a new method by accumulating the access frequency of records and reducing the accumulation variable exponentially and periodically.Additionally,we designed the distributed memory cache protocol based on consistent hashing to provide excellent scalability for the hot-index cache layer.Finally,we implemented a hierarchical indexing system HiBase.The experi-mental results on datasets ranging from 10 million to one billion records show that,the HiBase cold query(the cache-missed query)outperforms the standard HBase by 65 times (for large result sets)to more than 3000 times (for small result sets)respectively.Further,the HiBase hot query (the cache-hit query)after adopting the Hotscore Algorithm cache replacement policy can achieve extra 5—15 times speedup compared to the HiBase cold query,making the overall performance speedup more than 300 times (for large result sets)to 17 000 times (for small result sets)compared to the standard HBase and speedup 5—20 times compared to the open-source Hindex secondary indexing system.
What problem does this paper attempt to address?