Abstract:To retrieve a video via a multimedia search engine, a textual query is usually created by the user and then used to perform the search. Recent state-of-the-art cross-modal retrieval methods learn a joint text-video embedding space by using contrastive loss functions, which maximize the similarity of positive pairs while decreasing that of the negative pairs. Although the choice of these pairs is fundamental for the construction of the joint embedding space, the selection procedure is usually driven by the relationships found within the dataset: a positive pair is commonly formed by a video and its own caption, whereas unrelated video-caption pairs represent the negative ones. We hypothesize that this choice results in a retrieval system with limited semantics understanding, as the standard training procedure requires the system to discriminate between groundtruth and negative even though there is no difference in their semantics. Therefore, differently from the previous approaches, in this paper we propose a novel strategy for the selection of both positive and negative pairs which takes into account both the annotations and the semantic contents of the captions. By doing so, the selected negatives do not share semantic concepts with the positive pair anymore, and it is also possible to discover new positives within the dataset. Based on our hypothesis, we provide a novel design of two popular contrastive loss functions, and explore their effectiveness on three heterogeneous state-of-the-art approaches. The extensive experimental analysis conducted on two datasets, EPIC-Kitchens-100 and MSR-VTT, validates the effectiveness of the proposed strategy, observing, e.g., more than +20% nDCG on EPIC-Kitchens-100. Furthermore, these results are corroborated with qualitative evidence both supporting our hypothesis and explaining why the proposed strategy effectively overcomes it.

Learning a Multi-Concept Video Retrieval Model with Multiple Latent Variables

Unsupervised Teacher-Student Model for Large-Scale Video Retrieval.

Learning Structured Concept-Segments for Interactive Video Retrieval

Explicit and implicit concept-based video retrieval with bipartite graph propagation model.

Multiple Hypergraph Ranking for Video Concept Detection

Graph-based Multi-Space Semantic Correlation Propagation for Video Retrieval

Learning Concept Bundles for Video Search with Complex Queries

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

An Effective Video Retrieval Approach Based on Multi-modality Concept Correlation Graph

Video retrieval with multi-modal features.

Mapping Query to Semantic Concepts: Leveraging Semantic Indices for Automatic and Interactive Video Retrieval

The importance of query-concept-mapping for automatic video retrieval.

Improving semantic video retrieval models by training with a relevance-aware online mining strategy

Concept-Driven Multi-Modality Fusion for Video Search

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

Efficient Heuristic Methods for Multimodal Fusion and Concept Fusion in Video Concept Detection

Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

Video search re-ranking via multi-graph propagation.

Concepts Not Alone: Exploring Pairwise Relationships for Zero-Shot Video Activity Recognition

Interpretable Embedding for Ad-hoc Video Search

Query Representation by Structured Concept Threads with Application to Interactive Video Retrieval.