Abstract:Video moment retrieval targets at retrieving a golden moment in a video for a given natural language query. The main challenges of this task include 1) the requirement of accurately localizing (i.e., the start time and the end time of) the relevant moment in an untrimmed video stream, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, early approaches adopt the sliding window or uniform sampling to collect video clips first and then match each clip with the query to identify relevant clips. Obviously, these strategies are time-consuming and often lead to unsatisfied accuracy in localization due to the unpredictable length of the golden moment. To avoid the limitations, researchers recently attempt to directly predict the relevant moment boundaries without the requirement to generate video clips first. One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a regression approach upon the multimodal feature vector for boundary detection. Although some progress has been achieved by this approach, we argue that those methods have not well captured the cross-modal interactions between the query and video frames. In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM) model which predicts the temporal boundaries based on an interaction modeling between two modalities. In addition, an attention module is introduced to automatically assign higher weights to query words with richer semantic cues, which are considered to be more important for finding relevant video contents. Another contribution is that we propose an additional predictor to utilize the internal frames in the model training to improve the localization accuracy. Extensive experiments on two public datasets TACoS and Charades-STA demonstrate the superiority of our method over several state-of-the-art methods. Ablation studies have been also conducted to exami-e the effectiveness of different modules in our ACRM model.

Method and system for cross-mode-based video time location, and storage medium

Video Location Positioning Study Based on Two Steps Greedy Algorithm

Multi-modal Tag Localization for Mobile Video Search.

Cross-modal Moment Localization in Videos.

A Global Approach for Video Matching

Video fingerprint detecting and video sequence matching method and system based on visual features

Video Moment Localization via Deep Cross-Modal Hashing

Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization

Detecting both superimposed and scene text with multiple languages and multiple alignments in video

Frame-Wise Cross-Modal Matching for Video Moment Retrieval

Video attention moment retrieval method and device based on attention mechanism

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Transmedia search method based on multi-mode information convergence analysis

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Short video classification method, system and medium based on multimodal dictionary learning

Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation

Adaptive Spatial Location with Balanced Loss for Video Captioning

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

Video retrieval with multi-modal features.

Video Text Tracking With a Spatio-Temporal Complementary Model

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval