Abstract:Videos have become a new way of communication among Internet users with the proliferation of sensor-rich mobile devices. Due to the redundant background information in video data, people usually spend much time browsing and analyzing video content. This necessity motivates us to investigate the temporal sentence grounding task in videos. Formally, given an untrimmed video and a natural language sentence query, the task is to identify the start and end points of the video segment in response to the given sentence query. With such a technique, people can quickly find specific content of interest in the video by providing a clear and concise text description, thereby improving users’ video browsing experience and search efficiency. Previous methods often formulate the temporal grounding task as a multimodal matching problem. Doing so ignores the important sentence details for grounding and neglects the important guiding role of sentences to compose and correlate video contents over time, causing limited temporal grounding accuracy. To solve the above problems, we first propose a multimodal co-attention mechanism to mine important semantic details for temporal grounding in the given query and finely construct the semantic correlation between each word in the sentence and the video content. On this basis, we then propose a semantic condition dynamic normalization mechanism to tightly compose the sentence-related video content over time, including a clip-level actionness prediction module for fine-grained temporal boundary adjustment, thus making the temporal grounding results in the video clearer, more flexible, and more accurate than usual. Experiments on public datasets also verify our effectiveness and superiority over the state-of-the-arts. Last but not least, we present our insights on future research directions that deserve further investigations in the areas of audio-enabled temporal grounding techniques, weakly supervised grounding problem formulation, and debiased temporal grounding dataset construction.

Correlative Multilabel Video Annotation with Temporal Kernels

A Unifying Multi-Label Temporal Kernel Machine with Its Application to Video Annotation

Correlative Multi-Label Video Annotation.

Sequence Multi-Labeling: A Unified Video Annotation Scheme with Spatial and Temporal Context

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Ensemble Multi-Instance Multi-Label Learning Approach for Video Annotation Task

Representation Learning Through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

Ensemble Approach Based on Conditional Random Field for Multi-Label Image and Video Annotation

Semi-supervised multi-instance multi-label learning for video annotation task.

Video event recognition using kernel methods with multilevel temporal alignment

Mining Concept Relationship in Temporal Context for Effective Video Annotation.

Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning

Temporal Sentence Grounding in Videos with Fine-Grained Multimodal Correlation

Refining Video Annotation by Exploiting Pairwise Concurrent Relation.

Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels

Joint learning of video scene detection and annotation via multi-modal adaptive context network

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

Correlative multi-label multi-instance image annotation

Context-aware focal alignment network for micro-video multi-label classification

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Deep Multimodal Representation Learning from Temporal Data