Efficient Document Analytics on Compressed Data

Feng Zhang,Jidong Zhai,Xipeng Shen,Onur Mutlu,Wenguang Chen
DOI: https://doi.org/10.14778/3236187.3236203
IF: 2.5
2018-01-01
Proceedings of the VLDB Endowment
Abstract:Today's rapidly growing document volumes pose pressing challenges to modern document analytics, in both space usage and processing time. In this work, we propose the concept of compression-based direct processing to alleviate issues in both dimensions. The main idea is to enable direct document analytics on compressed data. We present how the concept can be materialized on Sequitur, a compression algorithm that produces hierarchical grammar-like representations. We discuss the major challenges in applying the idea to various document analytics tasks, and reveal a set of guidelines and also assistant software modules for developers to effectively apply compression-based direct processing . Experiments show that our proposed techniques save 90.8% storage space and 77.5% memory usage, while speeding up data processing significantly, i.e., by 1.6X on sequential systems, and 2.2X on distributed clusters, on average.
What problem does this paper attempt to address?