AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Lan Li,Liri Fang,Vetle I. Torvik
2024-12-10
Abstract:We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation & Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.
Databases,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to automatically generate data - cleaning workflows and evaluate the performance of large language models (LLMs) in this task. Specifically, the author proposes an LLM - based automatic data - cleaning workflow generation framework - AutoDCWorkflow, aiming to automatically generate a sequence of data - cleaning operations through LLMs to solve three main problems in data quality: duplicate data, missing values, and inconsistent data formats. In addition, the paper also proposes a new dataset benchmark for evaluating the ability of LLMs to automatically generate data - cleaning workflows. ### Main contributions of the paper: 1. **Propose a three - stage pipeline based on LLMs**: Select the target columns, check the column quality, and generate operations and parameters. This pipeline can input the original table and the purpose of data analysis and output the data - cleaning workflow and the cleaned table. 2. **Propose a new dataset benchmark**: Used to evaluate the generated data - cleaning workflows, evaluated from multiple dimensions, including the dimension of the purpose answer, the dimension of column values, and the dimension of the workflow (operations). 3. **Evaluate the reasoning abilities of multiple LLM models**: Including Llama 3.1, Mistral, and Gemma 2. These models perform well in generating high - quality workflows, especially Llama 3.1. ### Specific problems and solutions in the paper: - **Problem**: Data cleaning is a labor - intensive and error - prone task, and it is necessary to select appropriate data operations and parameters according to specific analysis purposes. - **Solutions**: - **Select target columns**: Identify columns related to the analysis purpose. - **Check column quality**: Evaluate the data quality of each target column and generate a data quality report as an operation target. - **Generate operations and parameters**: Predict the next operation and its parameters according to the results of the data quality report. ### Datasets and evaluation methods: - **Datasets**: The paper uses four real - world datasets from different open platforms, including menu data, Chicago food inspection data, Paycheck Protection Program loan data, and dish data. - **Evaluation dimensions**: - **Dimension of the purpose answer**: Whether the correct answer that is the same as or close to that of the manually - cleaned table can be obtained from the repaired and cleaned table. - **Dimension of column values**: The similarity between the repaired table and the manually - cleaned table. - **Dimension of the workflow (operations)**: Whether the operations generated by AutoDCWorkflow are correct and complete. Through these methods, the paper systematically evaluates the performance of LLMs in data - cleaning tasks, providing an important reference for future research and applications.