Abstract:Videos have become a new way of communication among Internet users with the proliferation of sensor-rich mobile devices. Due to the redundant background information in video data, people usually spend much time browsing and analyzing video content. This necessity motivates us to investigate the temporal sentence grounding task in videos. Formally, given an untrimmed video and a natural language sentence query, the task is to identify the start and end points of the video segment in response to the given sentence query. With such a technique, people can quickly find specific content of interest in the video by providing a clear and concise text description, thereby improving users’ video browsing experience and search efficiency. Previous methods often formulate the temporal grounding task as a multimodal matching problem. Doing so ignores the important sentence details for grounding and neglects the important guiding role of sentences to compose and correlate video contents over time, causing limited temporal grounding accuracy. To solve the above problems, we first propose a multimodal co-attention mechanism to mine important semantic details for temporal grounding in the given query and finely construct the semantic correlation between each word in the sentence and the video content. On this basis, we then propose a semantic condition dynamic normalization mechanism to tightly compose the sentence-related video content over time, including a clip-level actionness prediction module for fine-grained temporal boundary adjustment, thus making the temporal grounding results in the video clearer, more flexible, and more accurate than usual. Experiments on public datasets also verify our effectiveness and superiority over the state-of-the-arts. Last but not least, we present our insights on future research directions that deserve further investigations in the areas of audio-enabled temporal grounding techniques, weakly supervised grounding problem formulation, and debiased temporal grounding dataset construction.

Video2Subtitle: Matching Weakly-Synchronized Sequences Via Dynamic Temporal Alignment

A Global Approach for Video Matching

Video Content Aware Dynamic Subtitles

Sequence Multi-Labeling: A Unified Video Annotation Scheme with Spatial and Temporal Context

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Multiscale Video Sequence Matching for Near-Duplicate Detection and Retrieval

Temporal Tessellation: A Unified Approach for Video Analysis

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Joint Generation of Captions and Subtitles with Dual Decoding

Temporal Sentence Grounding in Videos with Fine-Grained Multimodal Correlation

Speaker-following Video Subtitles

Video-to-Video Translation with Global Temporal Consistency.

Video Text Tracking With a Spatio-Temporal Complementary Model

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Character-aware audio-visual subtitling in context

Reading the Videos: Temporal Labeling for Crowdsourced Time-Sync Videos Based on Semantic Embedding.

Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

Aligning Subtitles in Sign Language Videos

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval