Abstract:The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as insufficient documentation, inaccurate annotations, and ethical concerns, remain common in datasets widely used in AI. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, requiring expensive manual identification and verification by dataset users or maintainers. With the increasing capability of large language models (LLMs), it is promising to streamline the curation of datasets with LLM agents. In this work, as the initial step towards this goal, we propose a dataset curation agent benchmark, DCA-Bench, to measure LLM agents' capability of detecting hidden dataset quality issues. Specifically, we collect diverse real-world dataset quality issues from eight open dataset platforms as a testbed. Additionally, to establish an automatic pipeline for evaluating the success of LLM agents, which requires a nuanced understanding of the agent outputs, we implement a dedicated Evaluator using another LLM agent. We demonstrate that the LLM-based Evaluator empirically aligns well with human evaluation, allowing reliable automatic evaluation on the proposed benchmark. We further conduct experiments on several baseline LLM agents on the proposed benchmark and demonstrate the complexity of the task, indicating that applying LLMs to real-world dataset curation still requires further in-depth exploration and innovation. Finally, the proposed benchmark can also serve as a testbed for measuring the capability of LLMs in problem discovery rather than just problem-solving. The benchmark suite is available at \url{<a class="link-external link-https" href="https://github.com/TRAIS-Lab/dca-bench" rel="external noopener nofollow">this https URL</a>}.

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

AIBench: an Industry Standard AI Benchmark Suite from Internet Services.

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

AIBench Scenario: Scenario-distilling AI Benchmarking

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

Benchmarking Data Science Agents

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

AgentBench: Evaluating LLMs as Agents

Data Interpreter: An LLM Agent For Data Science

DCA-Bench: A Benchmark for Dataset Curation Agents

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

PyBench: Evaluating LLM Agent on various real-world coding tasks

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

BLADE: Benchmarking Language Model Agents for Data-Driven Science

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Agent-as-a-Judge: Evaluate Agents with Agents