Abstract:Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.

STVGBert - A Visual-linguistic Transformer Based Framework for Spatio-temporal Video Grounding.

Context-Guided Spatio-Temporal Video Grounding

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

TransVOS: Video Object Segmentation with Transformers

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Described Spatial-Temporal Video Detection

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Rethinking Video Sentence Grounding from a Tracking Perspective with Memory Network and Masked Attention

Cooperativity of two active sites in bacterial homodimeric aconitases.

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Temporal Sentence Grounding in Videos: A Survey and Future Directions

A Survey on Temporal Sentence Grounding in Videos

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Transformer-based Visual Grounding with Cross-modality Interaction

Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images

Decoupled Spatial Temporal Graphs for Generic Visual Grounding

GloTSFormer: Global Video Text Spotting Transformer