Multi-event Video-Text Retrieval

Gengyuan Zhang,Jisen Ren,Jindong Gu,Volker Tresp
2023-09-25
Abstract:Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at <a class="link-external link-https" href="https://github.com/gengyuanmax/MeVTR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the problem of Multi-event Video-Text Retrieval (MeVTR). Traditional Video-Text Retrieval (VTR) tasks usually assume a one-to-one correspondence between videos and texts. However, in reality, videos often contain multiple unrelated events, and text descriptions usually target only one of these events. This discrepancy leads to a decline in the performance of traditional models in practical applications. Specifically, the main contributions of the paper are as follows: 1. **Introduction of a new task**: Formally introduces the MeVTR task and defines new evaluation metrics to accommodate the situation where videos contain multiple events. 2. **Proposing a new model**: Proposes a new model named Me-Retriever, which handles multi-event videos through a key event selection module and a new MeVTR loss function. 3. **Experimental validation**: Demonstrates the effectiveness of Me-Retriever through extensive experiments and establishes it as a robust baseline for future research. Through this work, the authors hope to bridge the gap between traditional VTR frameworks and real-world application scenarios, thereby improving the performance of video-text retrieval tasks.