Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines

Patrick Hansert,Sebastian Michel
DOI: https://doi.org/10.14778/3681954.3682013
IF: 2.5
2024-07-01
Proceedings of the VLDB Endowment
Abstract:Data Lakes deployed in the cloud are a go-to solution for enterprise data storage. While the pay-as-you-go cost model allows flexible resource allocation and billing, it mandates an efficient use of resources like CPU hours, network traffic, and used storage. The distributed nature of cloud environments necessitates partitioning the data and processing these partitions separately. In this work, we put forward a practical solution to improve the efficiency of compression algorithms on Dremel-encoded data by clustering similarly structured nested data at ingestion time, such that compressible partitions can be created. We propose a clustering approach inspired by decision trees that outpaces even the naive partition-then-sort approach by up to factor 17.44 while also boosting the compression by up to factor 2. We further show that when sorting the individual buckets, a compression boost that is competitive with the well-established increasing-cardinality heuristic can be achieved, but at a lower ingestion time.
computer science, information systems, theory & methods
What problem does this paper attempt to address?