Abstract:Temporal sentence grounding in videos is a crucial task in vision-language learning. Its goal is retrieving a video segment from an untrimmed video that semantically corresponds to a natural language query. A video usually contains multiple semantic events, which are rarely isolated. They tend to be temporally ordered and semantically correlated (e.g., some event is often the precursor of another event). To precisely localize a semantic moment from a video, it is critical to effectively extract and aggregate multi-granularity contextual information, including the fine-grained local context around the moment-related video segment (in short snippet-level) and coarse-grained semantic correlation (in segment-level). Additionally, a second main insight in this work is that the above context aggregation should be favorably guided by the queries, rather than fully query-agnostic. Putting above ideas together, we here present a new network that does language-guided multi-granularity context aggregation. It is comprised of two major modules. The core of the first module is a novel language-guided temporal adaptive convolution (LTAC) devised to extract fine-grained information over video snippets around the ground-truth video segment. It decomposes a convolution into two channel-oriented / temporal-oriented ones. In particular, the convolutional channels are supposed to be more susceptible to queries, thus we learn to generate a dynamic channel-oriented kernel with respect to the querying sentence. As a second module, we propose a language-guided global relation block (LGRB) that extracts video-level context. It augments the contextual feature by using a multi-scale temporal attention that tackles the scale variation of ground-truth video segments, and a multi-modal semantic attention that relies on syntactic of the query. For the validation purpose, we have conducted comprehensive experiments on two popularly-adopted video benchmarks (i.e., ActivityNet Captions and Charades-STA). All experimental results and ablation studies have clearly corroborated the effectiveness of our model designs, outstripping prior state-of-the-art methods in terms of major performance metrics for the task.

Multi-semantic long-range dependencies capturing for efficient video representation learning

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Abnormal behavior capture of video dynamic target based on 3D convolutional neural network

SCREENING AND CHARACTERIZATION OF KERATINASE FROM Bacillus licheniformis ISOLATED FROM NAMAKKAL POULTRY FARM

DMVC: Multi-Camera Video Compression Network aimed at Improving Deep Learning Accuracy

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Dual Correlation Network for Efficient Video Semantic Segmentation

Selective Dependency Aggregation for Action Classification.

Language-Guided Multi-Granularity Context Aggregation for Temporal Sentence Grounding

Deep Dependency Networks for Multi-Label Classification

VideoMamba: State Space Model for Efficient Video Understanding

Deep Dependency Networks and Advanced Inference Schemes for Multi-Label Classification

Video object segmentation via couple streams and feature memory

SEAL: Semantic Attention Learning for Long Video Representation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Hierarchical Reinforcement Learning Based Video Semantic Coding for Segmentation

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Boosting Video Representation Learning with Multi-Faceted Integration

Deep Common Feature Mining for Efficient Video Semantic Segmentation

Attention-based Dual Context Aggregation for Image Semantic Segmentation

Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction