Abstract:Background: One in five U.S. adults lives with some kind of mental health condition and 4.6% of all U.S. adults have a serious mental illness. The Internet has become the first place for these people to seek online mental health information for help. However, online mental health information is not well-organized and often of low quality. There have been efforts in building evidence-based mental health knowledgebases curated with information manually extracted from the high-quality scientific literature. Manual extraction is inefficient. Crowdsourcing can potentially be a low-cost mechanism to collect labeled data from non-expert laypeople. However, there is not an existing annotation tool integrated with popular crowdsourcing platforms to perform the information extraction tasks. In our previous work, we prototyped a Semantic Text Annotation Tool (STAT) to address this gap. Objective: We aimed to refine the STAT prototype (1) to improve its usability and (2) to enhance the crowdsourcing workflow efficiency to facilitate the construction of evidence-based mental health knowledgebase, following a user-centered design (UCD) approach. Methods: Following UCD principles, we conducted four design iterations to improve the initial STAT prototype. In the first two iterations, usability testing focus groups were conducted internally with 8 participants recruited from a convenient sample, and the usability was evaluated with a modified System Usability Scale (SUS). In the following two iterations, usability testing was conducted externally using the Amazon Mechanical Turk (MTurk) platform. In each iteration, we summarized the usability testing results through thematic analysis, identified usability issues, and conducted a heuristic evaluation to map identified usability issues to Jakob Nielsen's usability heuristics. We collected suggested improvements in the usability testing sessions and enhanced STAT accordingly in the next UCD iteration. After four UCD iterations, we conducted a case study of the system on MTurk using mental health related scientific literature. We compared the performance of crowdsourcing workers with two expert annotators from two aspects: efficiency and quality. Results: The SUS score increased from 70.3 +/- 12.5 to 81.1 +/- 9.8 after the two internal UCD iterations as we improved STAT' s functionality based on the suggested improvements. We then evaluated STAT externally through MTurk in the following two iterations. The SUS score decreased to 55.7 +/- 20.1 in the third iteration, probably because of the complexity of the tasks. After further simplification of STAT and the annotation tasks with an improved annotation guideline, the SUS score increased to 73.8 +/- 13.8 in the fourth iteration of UCD. In the evaluation case study, on average, the workers spent 125.5 +/- 69.2 s on the onboarding tutorial and the crowdsourcing workers spent significantly less time on the annotation tasks compared to the two experts. In terms of annotation quality, the workers' annotation results achieved average Fl-scores ranged from 0.62 to 0.84 for the different sentences. Conclusions: We successfully developed a web-based semantic text annotation tool, STAT, to facilitate the curation of semantic web knowledgebases through four UCD iterations. The lessons learned from the UCD process could serve as a guide to further enhance STAT and the development and design of other crowdsourcing-based semantic text annotation tasks. Our study also showed that a well-organized, informative annotation guideline is as important as the annotation tool itself. Further, we learned that a crowdsourcing task should consist of multiple simple microtasks rather than a complicated task.

Collection of User Judgments on Spoken Dialog System with Crowdsourcing

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems

Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit

Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Towards Better Understanding of User Satisfaction in Open-Domain Conversational Search

Understanding User Satisfaction with Task-oriented Dialogue Systems

On Crowdsourcing-design with Comparison Category Rating for Evaluating Speech Enhancement Algorithms

User-centered Design of a Web-Based Crowdsourcing-Integrated Semantic Text Annotation Tool for Building a Mental Health Knowledge Base.

Annotator Rationales for Labeling Tasks in Crowdsourcing

A Survey of NLP-Related Crowdsourcing HITs: what works and what does not

Crowdsourcing in the Absence of Ground Truth -- A Case Study

CDAS: A Crowdsourcing Data Analytics System

FFAEval: Evaluating Dialogue System Via Free-For-All Ranking

LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing

Towards Best Experiment Design for Evaluating Dialogue System Output

Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods

An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue Systems

Speech Sentiment and Customer Satisfaction Estimation in Socialbot Conversations