Abstract:Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems encountered in the development of machine learning (ML) data preparation code. Specifically, it targets the following aspects: 1. **The cumbersomeness of manual debugging and improvement of ML pipelines**: - Data scientists usually need to repeatedly screen, debug, and improve the code in ML pipelines to discover and fix potential problems. This process is not only time - consuming and error - prone but also requires a high level of expertise. 2. **Lack of interactive improvement suggestions**: - Current data scientists, when developing ML pipelines, do not have effective tools that can provide immediate, interactive improvement suggestions during the development process. This causes them to have to "guess" possible problems and verify these assumptions through trial and error. 3. **Limitations of existing systems**: - Existing systems such as mlinspect, DataScope, mlwhatif, etc., although they can help detect certain problems, they usually assume that data scientists already know the specific types of problems to look for, and the design of these systems does not consider the iterative development cycle, resulting in long execution times. ### Proposed solutions To solve the above problems, the author proposes a new method called "shadow pipelines". Shadow pipelines are hidden variants of the original pipelines and are used to automatically detect potential problems and try different improvement schemes. The specific steps are as follows: - **Problem detection**: Introduce operators to screen for potential problems in the original pipeline. - **Root cause analysis**: Locate specific operators or input tuples to determine the root cause of the problem. - **Improvement suggestions**: Generate improvement suggestions and provide source - based explanations and quantitative evaluations of the expected impact. To ensure low - latency computing, the author also proposes using Incremental View Maintenance (IVM) technology to reuse and update intermediate results, thereby reducing unnecessary repeated calculations. ### Experimental verification The author conducted preliminary experiments, demonstrating the feasibility and optimized performance improvement of shadow pipelines. The experimental results show that the optimized shadow pipelines can reduce the running time by up to 38 times, significantly improving the efficiency of the interactive user experience. ### Summary In general, this paper aims to automate and accelerate the debugging and improvement process of ML pipelines by introducing the concept of shadow pipelines, providing immediate, interactive improvement suggestions, thereby improving the work efficiency of data scientists and the quality of ML models.

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!

Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

ChatPipe: Orchestrating Data Preparation Program by Optimizing Human-ChatGPT Interactions

Assisted design of data science pipelines

Towards Observability for Production Machine Learning Pipelines

MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines

AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline Composition and Optimisation

Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Incremental Search Space Construction for Machine Learning Pipeline Synthesis

STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison

DiffML: End-to-end Differentiable ML Pipelines

Exploiting Reuse in Pipeline-Aware Hyperparameter Tuning

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation.

Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics

AVATAR -- Machine Learning Pipeline Evaluation Using Surrogate Model

Efficient Tabular Data Preprocessing of ML Pipelines

Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of Machine Learning Models

Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines