M^3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

Zhe Chen,Heyang Liu,Wenyi Yu,Guangzhi Sun,Hongcheng Liu,Ji Wu,Chao Zhang,Yu Wang,Yanfeng Wang
DOI: https://doi.org/10.18653/v1/2024.acl-long.489
2024-01-01
Abstract:Publishing open-source academic video recordings is an emergent and prevalentapproach to sharing knowledge online. Such videos carry rich multimodalinformation including speech, the facial and body movements of the speakers, aswell as the texts and pictures in the slides and possibly even the papers.Although multiple academic video datasets have been constructed and released,few of them support both multimodal content recognition and understandingtasks, which is partially due to the lack of high-quality human annotations. Inthis paper, we propose a novel multimodal, multigenre, and multipurposeaudio-visual academic lecture dataset (M^3AV), which has almost 367 hours ofvideos from five sources covering computer science, mathematics, and medicaland biology topics. With high-quality human annotations of the spoken andwritten words, in particular high-valued name entities, the dataset can be usedfor multiple audio-visual recognition and understanding tasks. Evaluationsperformed on contextual speech recognition, speech synthesis, and slide andscript generation tasks demonstrate that the diversity of M^3AV makes it achallenging dataset.
What problem does this paper attempt to address?