Abstract:We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation & Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to automatically generate data - cleaning workflows and evaluate the performance of large language models (LLMs) in this task. Specifically, the author proposes an LLM - based automatic data - cleaning workflow generation framework - AutoDCWorkflow, aiming to automatically generate a sequence of data - cleaning operations through LLMs to solve three main problems in data quality: duplicate data, missing values, and inconsistent data formats. In addition, the paper also proposes a new dataset benchmark for evaluating the ability of LLMs to automatically generate data - cleaning workflows. ### Main contributions of the paper: 1. **Propose a three - stage pipeline based on LLMs**: Select the target columns, check the column quality, and generate operations and parameters. This pipeline can input the original table and the purpose of data analysis and output the data - cleaning workflow and the cleaned table. 2. **Propose a new dataset benchmark**: Used to evaluate the generated data - cleaning workflows, evaluated from multiple dimensions, including the dimension of the purpose answer, the dimension of column values, and the dimension of the workflow (operations). 3. **Evaluate the reasoning abilities of multiple LLM models**: Including Llama 3.1, Mistral, and Gemma 2. These models perform well in generating high - quality workflows, especially Llama 3.1. ### Specific problems and solutions in the paper: - **Problem**: Data cleaning is a labor - intensive and error - prone task, and it is necessary to select appropriate data operations and parameters according to specific analysis purposes. - **Solutions**: - **Select target columns**: Identify columns related to the analysis purpose. - **Check column quality**: Evaluate the data quality of each target column and generate a data quality report as an operation target. - **Generate operations and parameters**: Predict the next operation and its parameters according to the results of the data quality report. ### Datasets and evaluation methods: - **Datasets**: The paper uses four real - world datasets from different open platforms, including menu data, Chicago food inspection data, Paycheck Protection Program loan data, and dish data. - **Evaluation dimensions**: - **Dimension of the purpose answer**: Whether the correct answer that is the same as or close to that of the manually - cleaned table can be obtained from the repaired and cleaned table. - **Dimension of column values**: The similarity between the repaired table and the manually - cleaned table. - **Dimension of the workflow (operations)**: Whether the operations generated by AutoDCWorkflow are correct and complete. Through these methods, the paper systematically evaluates the performance of LLMs in data - cleaning tasks, providing an important reference for future research and applications.

AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

IterClean: an Iterative Data Cleaning Framework with Large Language Models

AutoFlow: Automated Workflow Generation for Large Language Model Agents

Data Cleaning Using Large Language Models

DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

Testing the use of a large language model (LLM) for performing data quality assessment

LLM4Workflow: An LLM-based Automated Workflow Model Generation Tool

LLM-Assisted Code Cleaning For Training Accurate Code Generators

Making LLMs Work for Enterprise Data Tasks

LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

A Hybrid Data Cleaning Framework Using Markov Logic Networks

Benchmarking Agentic Workflow Generation

TaskBench: Benchmarking Large Language Models for Task Automation

UniDM: A Unified Framework for Data Manipulation with Large Language Models

Automatic Data Transformation Using Large Language Model: An Experimental Study on Building Energy Data

Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation