Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Stefan Grafberger,Paul Groth,Sebastian Schelter
DOI: https://doi.org/10.1145/3650203.3663327
2024-04-30
Abstract:Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.
Databases,Machine Learning,Software Engineering
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems encountered in the development of machine learning (ML) data preparation code. Specifically, it targets the following aspects: 1. **The cumbersomeness of manual debugging and improvement of ML pipelines**: - Data scientists usually need to repeatedly screen, debug, and improve the code in ML pipelines to discover and fix potential problems. This process is not only time - consuming and error - prone but also requires a high level of expertise. 2. **Lack of interactive improvement suggestions**: - Current data scientists, when developing ML pipelines, do not have effective tools that can provide immediate, interactive improvement suggestions during the development process. This causes them to have to "guess" possible problems and verify these assumptions through trial and error. 3. **Limitations of existing systems**: - Existing systems such as mlinspect, DataScope, mlwhatif, etc., although they can help detect certain problems, they usually assume that data scientists already know the specific types of problems to look for, and the design of these systems does not consider the iterative development cycle, resulting in long execution times. ### Proposed solutions To solve the above problems, the author proposes a new method called "shadow pipelines". Shadow pipelines are hidden variants of the original pipelines and are used to automatically detect potential problems and try different improvement schemes. The specific steps are as follows: - **Problem detection**: Introduce operators to screen for potential problems in the original pipeline. - **Root cause analysis**: Locate specific operators or input tuples to determine the root cause of the problem. - **Improvement suggestions**: Generate improvement suggestions and provide source - based explanations and quantitative evaluations of the expected impact. To ensure low - latency computing, the author also proposes using Incremental View Maintenance (IVM) technology to reuse and update intermediate results, thereby reducing unnecessary repeated calculations. ### Experimental verification The author conducted preliminary experiments, demonstrating the feasibility and optimized performance improvement of shadow pipelines. The experimental results show that the optimized shadow pipelines can reduce the running time by up to 38 times, significantly improving the efficiency of the interactive user experience. ### Summary In general, this paper aims to automate and accelerate the debugging and improvement process of ML pipelines by introducing the concept of shadow pipelines, providing immediate, interactive improvement suggestions, thereby improving the work efficiency of data scientists and the quality of ML models.