Abstract:Videos have become a new way of communication among Internet users with the proliferation of sensor-rich mobile devices. Due to the redundant background information in video data, people usually spend much time browsing and analyzing video content. This necessity motivates us to investigate the temporal sentence grounding task in videos. Formally, given an untrimmed video and a natural language sentence query, the task is to identify the start and end points of the video segment in response to the given sentence query. With such a technique, people can quickly find specific content of interest in the video by providing a clear and concise text description, thereby improving users’ video browsing experience and search efficiency. Previous methods often formulate the temporal grounding task as a multimodal matching problem. Doing so ignores the important sentence details for grounding and neglects the important guiding role of sentences to compose and correlate video contents over time, causing limited temporal grounding accuracy. To solve the above problems, we first propose a multimodal co-attention mechanism to mine important semantic details for temporal grounding in the given query and finely construct the semantic correlation between each word in the sentence and the video content. On this basis, we then propose a semantic condition dynamic normalization mechanism to tightly compose the sentence-related video content over time, including a clip-level actionness prediction module for fine-grained temporal boundary adjustment, thus making the temporal grounding results in the video clearer, more flexible, and more accurate than usual. Experiments on public datasets also verify our effectiveness and superiority over the state-of-the-arts. Last but not least, we present our insights on future research directions that deserve further investigations in the areas of audio-enabled temporal grounding techniques, weakly supervised grounding problem formulation, and debiased temporal grounding dataset construction.

Video Grounding and Its Generalization

Temporal Sentence Grounding in Videos with Fine-Grained Multimodal Correlation

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Dense Events Grounding in Video.

Weakly-Supervised Spoken Video Grounding Via Semantic Interaction Learning.

Video sentence grounding with temporally global textual knowledge

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

End-to-End Dense Video Grounding via Parallel Regression

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Zero-Shot Video Grounding With Pseudo Query Lookup and Verification

A Survey on Temporal Sentence Grounding in Videos

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

End-to-end Multi-modal Video Temporal Grounding

Temporal Sentence Grounding in Videos: A Survey and Future Directions

Language-free Training for Zero-shot Video Grounding

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding

Video-Guided Curriculum Learning for Spoken Video Grounding

SnAG: Scalable and Accurate Video Grounding

Equivariant and Invariant Grounding for Video Question Answering