Abstract:This paper focuses on temporal retrieval of activities in videos via sentence queries. Given a sentence query describing an activity, temporal moment retrieval aims at localizing the temporal segment within the video that best describes the textual query. This is a general yet challenging task as it requires the comprehending of both video and language. Existing research pre-dominantly employ coarse frame-level features as the visual representation, obfuscating the specific details (e.g., the desired objects "girl", "cup" and action "pour") within the video which may provide critical cues for localizing the desired moment. In this paper, we propose a novel Spatial and Language-Temporal Tensor Fusion (SLTF) approach to resolve those issues. Specifically, the SLTF method first takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features "girl", "cup") by spatial attention. Then we encode the sequence of the local features on consecutive frames by employing LSTM network, which can capture the motion information and interactions among these objects (e.g., the interaction "pour" involving these two objects). Meanwhile, language-temporal attention is utilized to emphasize the keywords based on moment context information. Thereafter, a tensor fusion network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. Therefore, our proposed two attention sub-networks can adaptively recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query for retrieving the desired moment. Experimental results on three public benchmark datasets (obtained from TACOS, Charades-STA, and DiDeMo) show that the SLTF model significantly outperforms current state-of-the-art approaches, and demonstrate the benefits produced by new technologies incorporated into SLTF.

Local-enhanced Interaction for Temporal Moment Localization

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

Structured Multi-Level Interaction Network for Video Moment Localization via Language Query

Collaborative Spatial-Temporal Interaction for Language-Based Moment Retrieval

Cross-modal Moment Localization in Videos.

Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos

Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention.

Attentive Moment Retrieval in Videos

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.

Moment Overlapping in Temporal Moment Localization in Videos Using Natural Language

Dual-Channel Localization Networks for Moment Retrieval with Natural Language

Dual Path Interaction Network for Video Moment Localization

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction

Multi-Level Query Interaction for Temporal Language Grounding

Interaction-Integrated Network for Natural Language Moment Localization.

Language Guided Networks for Cross-modal Moment Retrieval

Temporal Textual Localization in Video Via Adversarial Bi-Directional Interaction Networks

Transferable Video Moment Localization by Moment-Guided Query Prompting

SLTFNet: A Spatial and Language-Temporal Tensor Fusion Network for Video Moment Retrieval