Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia
Tzu-Sheng Kuo,Aaron Halfaker,Zirui Cheng,Jiwoo Kim,Meng-Hsin Wu,Tongshuang Wu,Kenneth Holstein,Haiyi Zhu
DOI: https://doi.org/10.1145/3613904.3642278
2024-02-22
Abstract:AI tools are increasingly deployed in community contexts. However, datasets
used to evaluate AI are typically created by developers and annotators outside
a given community, which can yield misleading conclusions about AI performance.
How might we empower communities to drive the intentional design and curation
of evaluation datasets for AI that impacts them? We investigate this question
on Wikipedia, an online community with multiple AI-based content moderation
tools deployed. We introduce Wikibench, a system that enables communities to
collaboratively curate AI evaluation datasets, while navigating ambiguities and
differences in perspective through discussion. A field study on Wikipedia shows
that datasets curated using Wikibench can effectively capture community
consensus, disagreement, and uncertainty. Furthermore, study participants used
Wikibench to shape the overall data curation process, including refining label
definitions, determining data inclusion criteria, and authoring data
statements. Based on our findings, we propose future directions for systems
that support community-driven data curation.
Artificial Intelligence,Human-Computer Interaction