PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data

Zheng Zhang,Zheng Ning,Chenliang Xu,Yapeng Tian,Toby Jia-Jun Li
DOI: https://doi.org/10.1145/3586183.3606776
2023-07-28
Abstract:Audio-visual learning seeks to enhance the computer's multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks, such as video retrieval, AR/VR, and accessibility, the performance and adoption of existing audio-visual models have been impeded by the availability of high-quality datasets. Annotating audio-visual datasets is laborious, expensive, and time-consuming. To address this challenge, we designed and developed an efficient audio-visual annotation tool called Peanut. Peanut's human-AI collaborative pipeline separates the multi-modal task into two single-modal tasks, and utilizes state-of-the-art object detection and sound-tagging models to reduce the annotators' effort to process each frame and the number of manually-annotated frames needed. A within-subject user study with 20 participants found that Peanut can significantly accelerate the audio-visual data annotation process while maintaining high annotation accuracy.
Human-Computer Interaction
What problem does this paper attempt to address?