Abstract:Temporal sentence grounding in videos is a crucial task in vision-language learning. Its goal is retrieving a video segment from an untrimmed video that semantically corresponds to a natural language query. A video usually contains multiple semantic events, which are rarely isolated. They tend to be temporally ordered and semantically correlated (e.g., some event is often the precursor of another event). To precisely localize a semantic moment from a video, it is critical to effectively extract and aggregate multi-granularity contextual information, including the fine-grained local context around the moment-related video segment (in short snippet-level) and coarse-grained semantic correlation (in segment-level). Additionally, a second main insight in this work is that the above context aggregation should be favorably guided by the queries, rather than fully query-agnostic. Putting above ideas together, we here present a new network that does language-guided multi-granularity context aggregation. It is comprised of two major modules. The core of the first module is a novel language-guided temporal adaptive convolution (LTAC) devised to extract fine-grained information over video snippets around the ground-truth video segment. It decomposes a convolution into two channel-oriented / temporal-oriented ones. In particular, the convolutional channels are supposed to be more susceptible to queries, thus we learn to generate a dynamic channel-oriented kernel with respect to the querying sentence. As a second module, we propose a language-guided global relation block (LGRB) that extracts video-level context. It augments the contextual feature by using a multi-scale temporal attention that tackles the scale variation of ground-truth video segments, and a multi-modal semantic attention that relies on syntactic of the query. For the validation purpose, we have conducted comprehensive experiments on two popularly-adopted video benchmarks (i.e., ActivityNet Captions and Charades-STA). All experimental results and ablation studies have clearly corroborated the effectiveness of our model designs, outstripping prior state-of-the-art methods in terms of major performance metrics for the task.

Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

Language-Guided Multi-Granularity Context Aggregation for Temporal Sentence Grounding

Temporal Sentence Grounding in Videos with Fine-Grained Multimodal Correlation

Multi-Scale Contrastive Learning for Video Temporal Grounding

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Weakly-Supervised Spoken Video Grounding Via Semantic Interaction Learning.

End-to-end Multi-modal Video Temporal Grounding

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Weakly Supervised Temporal Adjacent Network for Language Grounding

Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos

A Survey on Temporal Sentence Grounding in Videos

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

WINNER: Weakly-Supervised Hierarchical Decomposition and Alignment for Spatio-tEmporal Video Grounding

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

Training-free Video Temporal Grounding using Large-scale Pre-trained Models