Collection of User Judgments on Spoken Dialog System with Crowdsourcing

Zhaojun Yang,Baichuan Li,Yi Zhu,Irwin King,Gina-Anne Levow,Helen M. Meng
DOI: https://doi.org/10.1109/slt.2010.5700864
2010-01-01
Abstract:This paper presents an initial attempt at the use of crowd-sourcing for collection of user judgments on spoken dialog systems (SDSs). This is implemented on Amazon Mechanical Turk (MTurk), where a Requester can design a human intelligence task (HIT) to be performed by a large number of Workers efficiently and cost-effectively. We describe a design methodology for two types of HITs - the first targets at fast rating of a large number of dialogs regarding some dimensions of the SDS's performance and the second aims to assess the reliability of Workers on MTurk through the variability in ratings across different Workers. A set of approval rules are also designed to control the quality of ratings from MTurk. At the end of the collection work, user judgments for about 8,000 dialogs rated by around 700Workers are collected in 45 days. We observe reasonable consistency between the manual MTurk ratings and an automatic categorization of dialogs in terms of task completion, which partially verifies the reliability of the approved ratings from MTurk. From the second type of HITs, we also observe moderate inter-rater agreement for ratings in task completion which provides support for the utilization of MTurk as a judgments collection platform. Further research on the exploration of SDS evaluation models could be developed based on the collected corpus.
What problem does this paper attempt to address?