Abstract:In the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look similar, the latter is more challenging due to two main reasons: 1) The former task only needs to judge whether the query occurs in a video and returns an entire video, but the latter is expected to judge which moment within a video matches the query and accurately returns the start and end points of the moment. Due to the fact that different moments in a video have varying durations and diverse spatial-temporal characteristics, uncovering the underlying moments is highly challenging. 2) As for the key component of relevance estimation, the former usually embeds a video and the query into a common space to compute the relevance score. However, the later task concerns moment localization where not only the features of a specific moment matter, but the context information of the moment also contributes a lot. For example, the query may contain temporal constraint words, such as "first'', therefore need temporal context to properly comprehend them. To address these issues, we develop an Attentive Cross-Modal Retrieval Network. In particular, we design a memory attention mechanism to emphasize the visual features mentioned in the query and simultaneously incorporate their context. In the light of this, we obtain the augmented moment representation. Meanwhile, a cross-modal fusion sub-network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. We evaluate our method on two datasets: DiDeMo and TACoS. Extensive experiments show the effectiveness of our model as compared to the state-of-the-art methods.

Self-Supervised Graph Convolution for Video Moment Retrieval

Weakly-Supervised Video Moment Retrieval Via Semantic Completion Network

Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction

Weakly Supervised Video Moment Retrieval From Text Queries

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

Video Moment Retrieval Via Comprehensive Relation-Aware Network

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

Cross-modal Video Moment Retrieval Based on Visual-Textual Relationship Alignment

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos.

Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

Attentive Moment Retrieval in Videos

Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism

Video Moment Retrieval from Text Queries via Single Frame Annotation

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Video Moment Retrieval with Noisy Labels

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval