Abstract:Data processing engines increasingly leverage distributed file systems for scalable, cost-effective storage. While the Apache Parquet columnar format has become a popular choice for data storage and retrieval, the immutability of Parquet files renders it impractical to meet the demands of frequent updates in contemporary analytical workloads. Log-Structured Tables (LSTs), such as Delta Lake, Apache Iceberg, and Apache Hudi, offer an alternative for scenarios requiring data mutability, providing a balance between efficient updates and the benefits of columnar storage. They provide features like transactions, time-travel, and schema evolution, enhancing usability and enabling access from multiple engines. Moreover, engines like Apache Spark and Trino can be configured to leverage the optimizations and controls offered by LSTs to meet specific business needs. Conventional benchmarks and tools are inadequate for evaluating the transformative changes in the storage layer resulting from these advancements, as they do not allow us to measure the impact of design and optimization choices in this new setting. In this paper, we propose a novel benchmarking approach and metrics that build upon existing benchmarks, aiming to systematically assess LSTs. We develop a framework, LST-Bench, which facilitates effective exploration and evaluation of the collaborative functioning of LSTs and data processing engines through tailored benchmark packages. A package is a mix of use patterns reflecting a target workload; LST-Bench makes it easy to define a wide range of use patterns and combine them into a package, and we include a baseline package for completeness. Our assessment demonstrates the effectiveness of our framework and benchmark packages in extracting valuable insights across diverse environments. The code for LST-Bench is open-sourced and is available at <a class="link-external link-https" href="https://github.com/microsoft/lst-bench/" rel="external noopener nofollow">this https URL</a> .

XTable in Action: Seamless Interoperability in Data Lakes

Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples

Robust Table Integration in Data Lakes

Integrating Data Lake Tables

Data Formats in Analytical DBMSs: Performance Trade-offs and Future Directions

Manipulating Data Lakes Intelligently With Java Annotations

One SQL to Rule Them All

TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations

Sigma Worksheet: Interactive Construction of OLAP Queries

Towards More Data-Aware Application Integration (extended version)

The Data Lakehouse: Data Warehousing and More

DIALITE: Discover, Align and Integrate Open Data Tables

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

DLToDW: Transferring Relational and NoSQL Databases from a Data Lake

Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake

Data Lakehouse: Next Generation Information System

LST-Bench: Benchmarking Log-Structured Tables in the Cloud

In unity there is strength: Showcasing a unified big data platform with MapReduce Over both object and file storage

A Big Data Lake for Multilevel Streaming Analytics

Finding Related Tables in Data Lakes for Interactive Data Science

Oracle-Based Implementation of Extract-Transform-Load for Enterprises