Abstract:Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT[1], MSVD[2] and LSMDC[3].

Hybrid Deep Neural Network for Visual Phrase Detection

Visual relationship detection with a deep convolutional relationship network

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

Geometric Neural Phrase Pooling: Modeling the Spatial Co-occurrence of Neurons

Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data

Detecting Visual Relationships with Deep Relational Networks

Multiple Instance Learning Using Visual Phrases for Object Classification

Phrase-based Image Captioning with Hierarchical LSTM Model

Multiple Visual Phrase Learning Method for Image Classification

Phrase Grounding Algorithm Based on Transformer Multilevel Feature Fusion

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

RPLNet: Object-Object Affordance Recognition via Relational Phrase Learning

Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection

Phrase Grounding by Soft-Label Chain Conditional Random Field

Natural Language Guided Visual Relationship Detection

Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval

Discriminative Bag-of-visual Phrase Learning for Landmark Recognition

Phrase-Based Affordance Detection Via Cyclic Bilateral Interaction

Phrase-level Prediction for Video Temporal Localization

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features