Abstract:The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as insufficient documentation, inaccurate annotations, and ethical concerns, remain common in datasets widely used in AI. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, requiring expensive manual identification and verification by dataset users or maintainers. With the increasing capability of large language models (LLMs), it is promising to streamline the curation of datasets with LLM agents. In this work, as the initial step towards this goal, we propose a dataset curation agent benchmark, DCA-Bench, to measure LLM agents' capability of detecting hidden dataset quality issues. Specifically, we collect diverse real-world dataset quality issues from eight open dataset platforms as a testbed. Additionally, to establish an automatic pipeline for evaluating the success of LLM agents, which requires a nuanced understanding of the agent outputs, we implement a dedicated Evaluator using another LLM agent. We demonstrate that the LLM-based Evaluator empirically aligns well with human evaluation, allowing reliable automatic evaluation on the proposed benchmark. We further conduct experiments on several baseline LLM agents on the proposed benchmark and demonstrate the complexity of the task, indicating that applying LLMs to real-world dataset curation still requires further in-depth exploration and innovation. Finally, the proposed benchmark can also serve as a testbed for measuring the capability of LLMs in problem discovery rather than just problem-solving. The benchmark suite is available at \url{<a class="link-external link-https" href="https://github.com/TRAIS-Lab/dca-bench" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: In modern artificial intelligence (AI) research and development, high - quality data sets are becoming increasingly important. Although there are currently many open data set platforms, the data sets on these platforms still generally have data quality problems such as insufficient documentation, inaccurate labeling, and ethical issues. These problems are usually relatively hidden and difficult to detect by rule - based scripts, requiring expensive manual identification and verification by data set users or maintainers. In addition, existing data set management tools and techniques mainly rely on rule - based scripts and lack flexibility, and cannot effectively discover the above - mentioned hidden data quality problems. In view of the continuous improvement of the capabilities of large - language models (LLMs), the paper proposes to use LLM agents to simplify the data set curation work. For this purpose, the paper proposes a data set curation agent benchmark named DCA - Bench, which aims to measure the ability of LLM agents to discover hidden data quality problems. Specifically, the authors collected diverse real - world data quality problems from eight open data set platforms as a test platform and implemented a special evaluator (using another LLM agent) to establish a pipeline for automatically evaluating the success of LLM agents, which requires a detailed understanding of the agent output. The experimental results show that the LLM - based evaluator is highly consistent with human evaluation in performance and can achieve reliable automatic evaluation. In addition, the paper also explores the performance of several baseline LLM agents on the proposed benchmark, shows the complexity of the task, and indicates that the application of LLMs to real - world data set curation still requires further in - depth exploration and innovation. In summary, the main goal of the paper is to promote the development of LLM agents that can autonomously discover data quality problems by proposing the DCA - Bench benchmark, thereby improving the quality of open data sets contributed by the community.

DCA-Bench: A Benchmark for Dataset Curation Agents

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

AIBench: an Industry Standard AI Benchmark Suite from Internet Services.

Benchmarking Data Science Agents

Aibench: an industry standard ai benchmark suite

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

AgentBench: Evaluating LLMs as Agents

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

CurBench: Curriculum Learning Benchmark

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

PyBench: Evaluating LLM Agent on various real-world coding tasks

Evaluating Cultural and Social Awareness of LLM Web Agents

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

AI Competitions and Benchmarks: Dataset Development