MultiVENT: Multilingual Videos of Events with Aligned Natural Text

Kate Sanders,David Etter,Reno Kriz,Benjamin Van Durme
2023-07-07
Abstract:Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.
Information Retrieval,Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the limitations of existing news video datasets, particularly their focus on traditional English news broadcasts while neglecting multilingual and multimodal online news video content. Specifically, the paper raises the following key issues: 1. **Insufficient Multilingual Coverage**: Existing news video datasets primarily contain English content and lack coverage of organically and naturally generated content in other languages. 2. **Lack of Content Diversity**: Current datasets mainly focus on traditional news broadcasts and fail to include non-professional first-hand video footage, which is becoming increasingly important in modern news reporting. 3. **Information Bias**: Due to the "translationese" problem in the translation process, existing multilingual datasets may contain unnatural content biases. To address these issues, the paper constructs a dataset named **MultiVENT**, which includes multilingual, event-centric videos and their corresponding natural text descriptions. MultiVENT comprises news broadcast videos and non-professional event videos, covering 5 target languages (Arabic, Chinese, English, Korean, and Russian), aiming to provide a more comprehensive and multi-perspective sample of event reporting. Through MultiVENT, the authors hope to explore and analyze the diversity and characteristics of online news videos and provide a baseline model (MultiCLIP) for complex multilingual video retrieval tasks to support the development of information retrieval systems.