Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie

Jacopo Tagliabue,Ciro Greco

2024-04-21

Abstract:As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.

Databases,Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the reproducibility of data workloads in the data lakes architecture. With the popularization of the Lakehouse architecture, ensuring the reproducibility of data workloads on the data lake has become a key challenge for data engineers. Specifically, the paper points out: 1. **Complexity of data pipelines and difficulty in debugging**: The large scale of data pipelines leads to slow testing and iteration. At the same time, business logic is intertwined with data management, making debugging complex and increasing the likelihood of errors. 2. **Limitations of existing tools**: Although existing tools can operate independently to a certain extent, when implementing the time - travel function across multiple components (such as input data, code, runtime environment, and hardware) to reproduce data pipelines, a great deal of engineering expertise, setup, and context switching are required. To solve these problems, the paper introduces the Bauplan and Nessie systems, which improve the reproducibility of data pipelines in the following ways: - **Decoupling computation and data management**: Using cloud runtimes and the open - source Nessie directory (with Git semantics), the separation of computation and data management is achieved, thereby simplifying debugging and reducing the error rate. - **Providing time - travel and branch semantics**: Time - travel and branch semantics are provided through object storage, and users can achieve complete pipeline reproduction with a few commands. - **Multi - language support and modular code**: Allowing data pipelines to be written in multiple programming languages and run directly in the cloud via the CLI (command - line interface), ensuring runtime compatibility and hardware flexibility. In summary, the paper aims to solve the reproducibility problem of data workloads in the data lake architecture by introducing a new framework, enabling data scientists and engineers to develop, debug, and maintain data pipelines more efficiently.

Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie

Building a serverless Data Lakehouse from spare parts

R2D2: Reducing Redundancy and Duplication in Data Lakes

The Data Lakehouse: Data Warehousing and More

A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks

Reproducible and Portable Big Data Analytics in the Cloud

A Big Data Lake for Multilevel Streaming Analytics

ESCAPE Data Lake

On the Logical Design of a Prototypical Data Lake System for Biological Resources

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Bauplan: zero-copy, scale-up FaaS for data pipelines

LakeBench: Benchmarks for Data Discovery over Data Lakes

Deep Lake: a Lakehouse for Deep Learning

Open Reproducible Neuroscience Research on Cloud with Infrastructure as Code

Enhancing Dependability in Big Data Analytics Enterprise Pipelines

Data Lakehouse: Next Generation Information System

Toward data lakes as central building blocks for data management and analysis

Realising Data-Centric Scientific Workflows with Provenance-Capturing on Data Lakes

A Review on Data Lake

Smart caching in a Data Lake for High Energy Physics analysis

BigDataScript: a scripting language for data pipelines