Abstract:The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as insufficient documentation, inaccurate annotations, and ethical concerns, remain common in datasets widely used in AI. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, requiring expensive manual identification and verification by dataset users or maintainers. With the increasing capability of large language models (LLMs), it is promising to streamline the curation of datasets with LLM agents. In this work, as the initial step towards this goal, we propose a dataset curation agent benchmark, DCA-Bench, to measure LLM agents' capability of detecting hidden dataset quality issues. Specifically, we collect diverse real-world dataset quality issues from eight open dataset platforms as a testbed. Additionally, to establish an automatic pipeline for evaluating the success of LLM agents, which requires a nuanced understanding of the agent outputs, we implement a dedicated Evaluator using another LLM agent. We demonstrate that the LLM-based Evaluator empirically aligns well with human evaluation, allowing reliable automatic evaluation on the proposed benchmark. We further conduct experiments on several baseline LLM agents on the proposed benchmark and demonstrate the complexity of the task, indicating that applying LLMs to real-world dataset curation still requires further in-depth exploration and innovation. Finally, the proposed benchmark can also serve as a testbed for measuring the capability of LLMs in problem discovery rather than just problem-solving. The benchmark suite is available at \url{<a class="link-external link-https" href="https://github.com/TRAIS-Lab/dca-bench" rel="external noopener nofollow">this https URL</a>}.

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Aibench: an industry standard ai benchmark suite

AIBench: an Industry Standard AI Benchmark Suite from Internet Services.

AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

DCA-Bench: A Benchmark for Dataset Curation Agents

AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Benchmarking Foundation Models with Language-Model-as-an-Examiner

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

LiveBench: A Challenging, Contamination-Free LLM Benchmark

TaskBench: Benchmarking Large Language Models for Task Automation

CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines