Abstract:Videos have become a new way of communication among Internet users with the proliferation of sensor-rich mobile devices. Due to the redundant background information in video data, people usually spend much time browsing and analyzing video content. This necessity motivates us to investigate the temporal sentence grounding task in videos. Formally, given an untrimmed video and a natural language sentence query, the task is to identify the start and end points of the video segment in response to the given sentence query. With such a technique, people can quickly find specific content of interest in the video by providing a clear and concise text description, thereby improving users’ video browsing experience and search efficiency. Previous methods often formulate the temporal grounding task as a multimodal matching problem. Doing so ignores the important sentence details for grounding and neglects the important guiding role of sentences to compose and correlate video contents over time, causing limited temporal grounding accuracy. To solve the above problems, we first propose a multimodal co-attention mechanism to mine important semantic details for temporal grounding in the given query and finely construct the semantic correlation between each word in the sentence and the video content. On this basis, we then propose a semantic condition dynamic normalization mechanism to tightly compose the sentence-related video content over time, including a clip-level actionness prediction module for fine-grained temporal boundary adjustment, thus making the temporal grounding results in the video clearer, more flexible, and more accurate than usual. Experiments on public datasets also verify our effectiveness and superiority over the state-of-the-arts. Last but not least, we present our insights on future research directions that deserve further investigations in the areas of audio-enabled temporal grounding techniques, weakly supervised grounding problem formulation, and debiased temporal grounding dataset construction.

Investigating Pooling Strategies and Loss Functions for Weakly-Supervised Text-to-Audio Grounding via Contrastive Learning.

Towards Weakly Supervised Text-to-Audio Grounding

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

Weakly Supervised Temporal Adjacent Network for Language Grounding

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling.

Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

Weakly-Supervised Spoken Video Grounding Via Semantic Interaction Learning.

Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding.

Staged training strategy and multi-activation for audio tagging with noisy and sparse multi-label data

Weakly-Supervised Video Object Grounding via Causal Intervention

Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception.

Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning

Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding.

Learning Comprehensive Visual Grounding for Video Captioning

Temporal Sentence Grounding in Videos with Fine-Grained Multimodal Correlation

Rethinking Weakly-supervised Video Temporal Grounding from a Game Perspective

Exploiting Auxiliary Caption for Video Grounding

Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding