Abstract:Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT[1], MSVD[2] and LSMDC[3].

Multi-order visual phrase for scalable image search

Multi-order Visual Phrase for Scalable Partial-Duplicate Visual Search

Embedding Multi-Order Spatial Clues for Scalable Visual Matching and Retrieval.

Visual Phraselet: Refining Spatial Constraints for Large Scale Image Search

Constructing Visual Phrases for Effective and Efficient Object-Based Image Retrieval

Effective and efficient object-based image retrieval using visual phrases.

Scalable Mobile Search with Binary Phrase

Visual word expansion and BSIFT verification for large-scale image search

Query Expansion by Spatial Co-Occurrence for Image Retrieval

Generating descriptive visual words and visual phrases for large-scale image applications

Large scale image retrieval with visual groups

Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval

Coherent Phrase Model for Efficient Image Near-Duplicate Retrieval

Large scale partial-duplicate image retrieval with bi-space quantization and geometric consistency

Multi-stage vector quantization towards low bit rate visual search

Scalable Feature Matching by Dual Cascaded Scalar Quantization for Image Retrieval

Contextual Query Expansion for Image Retrieval

Multi-Scale Visual Words For Object-Based Web Image Search

Visual Vocabulary Optimization with Spatial Context for Image Annotation and Classification

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval