Abstract:We study the problem of optimizing data storage and access costs on the cloud while ensuring that the desired performance or latency is unaffected. We first propose an optimizer that optimizes the data placement tier (on the cloud) and the choice of compression schemes to apply, for given data partitions with temporal access predictions. Secondly, we propose a model to learn the compression performance of multiple algorithms across data partitions in different formats to generate compression performance predictions on the fly, as inputs to the optimizer. Thirdly, we propose to approach the data partitioning problem fundamentally differently than the current default in most data lakes where partitioning is in the form of ingestion batches. We propose access pattern aware data partitioning and formulate an optimization problem that optimizes the size and reading costs of partitions subject to access patterns. We study the various optimization problems theoretically as well as empirically, and provide theoretical bounds as well as hardness results. We propose a unified pipeline of cost minimization, called SCOPe that combines the different modules. We extensively compare the performance of our methods with related baselines from the literature on TPC-H data as well as enterprise datasets (ranging from GB to PB in volume) and show that SCOPe substantially improves over the baselines. We show significant cost savings compared to platform baselines, of the order of 50% to 83% on enterprise Data Lake datasets that range from terabytes to petabytes in volume.

Corra: Correlation-Aware Column Compression

LeCo: Lightweight Compression Via Learning Serial Correlations

Lightweight Correlation-Aware Table Compression

SortComp (Sort-and-compress) - Towards a Universal Lossless Compression Scheme for Matrix and Tabular Data

An Empirical Evaluation of Columnar Storage Formats

Compressed Sensing Performance of Binary Matrices with Binary Column Correlations

Towards Optimizing Storage Costs on the Cloud

Cowic: A Column-Wise Independent Compression for Log Stream Analysis

Regression Cubes with Lossless Compression and Aggregation

Value-Compressed Sparse Column (VCSC): Sparse Matrix Storage for Redundant Data

Exploring Lossy Compressibility through Statistical Correlations of Scientific Datasets

Data Compression for Analytics over Large-scale In-memory Column Databases

Leveraging Spatial and Temporal Correlations for Network Traffic Compression

Efficient Semantic Matching with Hypercolumn Correlation

NULLS!: Revisiting Null Representation in Modern Columnar Formats

NULLS!: Revisiting Null Representation in Modern Columnar Formats.

The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

LDPC Code-Based Distributed Source Coding With an Efficient Message Passing Mechanism for the Compression of Correlated Image Sources

DEC: An Efficient Deduplication-Enhanced Compression Approach

Fine-Grained Correlation Representation for Graph-Based Point Cloud Attribute Compression