Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Minghang Zheng,Xinhao Cai,Qingchao Chen,Yuxin Peng,Yang Liu
2024-08-29
Abstract:Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the video temporal localization task, existing methods rely on specific datasets for training, resulting in poor generalization ability in cross - dataset and out - of - distribution (OOD) settings. Specifically, existing video temporal localization models are mainly trained on annotated data from specific datasets to understand the alignment relationship between video segments and natural language queries. However, the performance of these methods drops significantly when dealing with unseen data or different datasets. Moreover, collecting high - quality video temporal localization datasets is both time - consuming and labor - intensive, which limits the large - scale application of these methods in practical scenarios. To solve the above problems, the paper proposes a training - free video temporal localization method (TFVTG), which utilizes the capabilities of large - scale pre - trained models, especially large - language models (LLMs) and visual - language models (VLMs). This method aims not to rely on specific video temporal localization datasets, so as to better generalize to practical application scenarios. Specifically, the paper proposes the following innovations: 1. **Multi - event analysis**: Use LLMs to analyze multiple sub - events that may be included in the query text, and give the text description of each individual event, as well as its occurrence order and relationship. 2. **Dynamic and static scoring**: In order to better understand and localize dynamic transitions in videos, the paper divides events into dynamic parts and static parts, and designs dynamic scoring and static scoring functions respectively to evaluate the relevance of proposals to text queries. 3. **Sub - event localization and integration**: Use VLMs to localize each sub - event, and then filter and integrate these localization results according to the sub - event order and relationship provided by LLMs, and finally generate predictions. Through these methods, the TFVTG proposed in the paper achieves the best performance in the zero - sample video temporal localization task, especially showing better generalization ability in cross - dataset and OOD settings.