Abstract:Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Visual Grounding for Object-Level Generalization in Reinforcement Learning

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

VLGrammar: Grounded Grammar Induction of Vision and Language

Learning Visual Grounding from Generative Vision and Language Model

Language-guided Visual Attention Network for Visual Grounding

LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

Language Conditioned Multi-Scale Visual Attention Networks for Visual Grounding

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Learning to Ground VLMs without Forgetting

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models