Abstract:Video moment retrieval targets at retrieving a golden moment in a video for a given natural language query. The main challenges of this task include 1) the requirement of accurately localizing (i.e., the start time and the end time of) the relevant moment in an untrimmed video stream, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, early approaches adopt the sliding window or uniform sampling to collect video clips first and then match each clip with the query to identify relevant clips. Obviously, these strategies are time-consuming and often lead to unsatisfied accuracy in localization due to the unpredictable length of the golden moment. To avoid the limitations, researchers recently attempt to directly predict the relevant moment boundaries without the requirement to generate video clips first. One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a regression approach upon the multimodal feature vector for boundary detection. Although some progress has been achieved by this approach, we argue that those methods have not well captured the cross-modal interactions between the query and video frames. In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM) model which predicts the temporal boundaries based on an interaction modeling between two modalities. In addition, an attention module is introduced to automatically assign higher weights to query words with richer semantic cues, which are considered to be more important for finding relevant video contents. Another contribution is that we propose an additional predictor to utilize the internal frames in the model training to improve the localization accuracy. Extensive experiments on two public datasets TACoS and Charades-STA demonstrate the superiority of our method over several state-of-the-art methods. Ablation studies have been also conducted to exami-e the effectiveness of different modules in our ACRM model.

Gazing After Glancing: Edge Information Guided Perception Network for Video Moment Retrieval

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

PerimetryNet: A Multiscale Fine Grained Deep Network for Three-Dimensional Eye Gaze Estimation Using Visual Field Analysis

Video Moment Retrieval from Text Queries via Single Frame Annotation

Attentive Moment Retrieval in Videos

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval

Video Moment Retrieval with Noisy Labels

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

Rethinking Video Sentence Grounding from a Tracking Perspective with Memory Network and Masked Attention

Prompt-based Zero-shot Video Moment Retrieval

Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Context-aware network with foreground recalibration for grounding natural language in video

2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval

Transferable Video Moment Localization by Moment-Guided Query Prompting

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Frame-Wise Cross-Modal Matching for Video Moment Retrieval

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

Motion Guided Region Message Passing for Video Captioning