Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie

Jacopo Tagliabue,Ciro Greco
2024-04-21
Abstract:As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.
Databases,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the reproducibility of data workloads in the data lakes architecture. With the popularization of the Lakehouse architecture, ensuring the reproducibility of data workloads on the data lake has become a key challenge for data engineers. Specifically, the paper points out: 1. **Complexity of data pipelines and difficulty in debugging**: The large scale of data pipelines leads to slow testing and iteration. At the same time, business logic is intertwined with data management, making debugging complex and increasing the likelihood of errors. 2. **Limitations of existing tools**: Although existing tools can operate independently to a certain extent, when implementing the time - travel function across multiple components (such as input data, code, runtime environment, and hardware) to reproduce data pipelines, a great deal of engineering expertise, setup, and context switching are required. To solve these problems, the paper introduces the Bauplan and Nessie systems, which improve the reproducibility of data pipelines in the following ways: - **Decoupling computation and data management**: Using cloud runtimes and the open - source Nessie directory (with Git semantics), the separation of computation and data management is achieved, thereby simplifying debugging and reducing the error rate. - **Providing time - travel and branch semantics**: Time - travel and branch semantics are provided through object storage, and users can achieve complete pipeline reproduction with a few commands. - **Multi - language support and modular code**: Allowing data pipelines to be written in multiple programming languages and run directly in the cloud via the CLI (command - line interface), ensuring runtime compatibility and hardware flexibility. In summary, the paper aims to solve the reproducibility problem of data workloads in the data lake architecture by introducing a new framework, enabling data scientists and engineers to develop, debug, and maintain data pipelines more efficiently.