Abstract:To retrieve a video via a multimedia search engine, a textual query is usually created by the user and then used to perform the search. Recent state-of-the-art cross-modal retrieval methods learn a joint text-video embedding space by using contrastive loss functions, which maximize the similarity of positive pairs while decreasing that of the negative pairs. Although the choice of these pairs is fundamental for the construction of the joint embedding space, the selection procedure is usually driven by the relationships found within the dataset: a positive pair is commonly formed by a video and its own caption, whereas unrelated video-caption pairs represent the negative ones. We hypothesize that this choice results in a retrieval system with limited semantics understanding, as the standard training procedure requires the system to discriminate between groundtruth and negative even though there is no difference in their semantics. Therefore, differently from the previous approaches, in this paper we propose a novel strategy for the selection of both positive and negative pairs which takes into account both the annotations and the semantic contents of the captions. By doing so, the selected negatives do not share semantic concepts with the positive pair anymore, and it is also possible to discover new positives within the dataset. Based on our hypothesis, we provide a novel design of two popular contrastive loss functions, and explore their effectiveness on three heterogeneous state-of-the-art approaches. The extensive experimental analysis conducted on two datasets, EPIC-Kitchens-100 and MSR-VTT, validates the effectiveness of the proposed strategy, observing, e.g., more than +20% nDCG on EPIC-Kitchens-100. Furthermore, these results are corroborated with qualitative evidence both supporting our hypothesis and explaining why the proposed strategy effectively overcomes it.

Robust Semantic Video Indexing by Harvesting Web Images.

Exploiting Web Images for Semantic Video <newline/>indexing Via Robust Sample-Specific Loss

Improving semantic video retrieval models by training with a relevance-aware online mining strategy

Robust Semantic Concept Detection in Large Video Collections

Video diver: generic video indexing with diverse features.

Video News Indexing Using Semantic-Face

Semantic Video Search by Exploiting Large-Scale Visual Concepts

Fast And Accurate Content-Based Semantic Search In 100m Internet Videos

Semantic-based surveillance video retrieval

An integrated semantic-based approach in concept based video retrieval

Automatic Moving Object Extraction toward Content-Based Video Representation and Indexing

Applying Semantic Association To Support Content-Based Video Retrieval

Semantic Concept Learning Through Massive Internet Video Mining

Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base.

Video Data Mining: Semantic Indexing and Event Detection from the Association Perspective

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Recent Advances And Challenges Of Semantic Image/Video Search

Exploiting Semantic And Visual Context For Effective Video Annotation

Semantic Video Classification And Feature Subset Selection Under Context And Concept Uncertainty

Video indexing and retrieval in compressed domain using fuzzy-categorization

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images