MultiVENT: Multilingual Videos of Events with Aligned Natural Text

Kate Sanders,David Etter,Reno Kriz,Benjamin Van Durme

2023-07-07

Abstract:Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.

Information Retrieval,Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the limitations of existing news video datasets, particularly their focus on traditional English news broadcasts while neglecting multilingual and multimodal online news video content. Specifically, the paper raises the following key issues: 1. **Insufficient Multilingual Coverage**: Existing news video datasets primarily contain English content and lack coverage of organically and naturally generated content in other languages. 2. **Lack of Content Diversity**: Current datasets mainly focus on traditional news broadcasts and fail to include non-professional first-hand video footage, which is becoming increasingly important in modern news reporting. 3. **Information Bias**: Due to the "translationese" problem in the translation process, existing multilingual datasets may contain unnatural content biases. To address these issues, the paper constructs a dataset named **MultiVENT**, which includes multilingual, event-centric videos and their corresponding natural text descriptions. MultiVENT comprises news broadcast videos and non-professional event videos, covering 5 target languages (Arabic, Chinese, English, Korean, and Russian), aiming to provide a more comprehensive and multi-perspective sample of event reporting. Through MultiVENT, the authors hope to explore and analyze the diversity and characteristics of online news videos and provide a baseline model (MultiCLIP) for complex multilingual video retrieval tasks to support the development of information retrieval systems.

MultiVENT: Multilingual Videos of Events with Aligned Natural Text

MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

Grounding Partially-Defined Events in Multimodal Data

A Dataset with Multi-Modal Information and Multi-Granularity Descriptions for Video Captioning

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Multi-event Video-Text Retrieval

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection

Connecting Vision and Language with Video Localized Narratives

Towards Long Form Audio-visual Video Understanding

Towards Event-oriented Long Video Understanding

FakeSV: A Multimodal Benchmark with Rich Social Context for Fake News Detection on Short Video Platforms

End-to-end Multi-modal Video Temporal Grounding

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

A Survey of Video Datasets for Grounded Event Understanding

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Video Timeline Modeling For News Story Understanding

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs