Abstract:In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the retrieval effect while maintaining the retrieval efficiency in the text - to - video retrieval task. Specifically, although existing methods strengthen the interaction between vision (videos, frames) and text (sentences, words) by designing complex fusion modules, thus achieving a good alignment effect, these methods have problems such as high computational complexity, insufficient feature utilization, and low retrieval efficiency. In addition, too fine - grained feature calculation may amplify the noise in the local area of the video, resulting in a decline in the retrieval effect. To overcome these problems, this paper proposes a new method - EERCF (Efficient and Effective text - to - video Retrievl with Coarse - to - Fine visual representation learning), aiming to achieve efficient and effective text - to - video retrieval through coarse - to - fine visual representation learning. The main contributions of EERCF include: 1. **Introducing the Text - Gated Interaction Block (TIB) without additional learning parameters**: It is used for multi - granularity adaptive representation learning, and combines cross - feature contrast loss and intra - feature Pearson constraint to optimize feature learning. 2. **Proposing a two - stage text - to - video retrieval strategy**: This strategy significantly improves the retrieval efficiency while ensuring the retrieval effect, which is convenient for practical applications. 3. **Verifying the effectiveness of the method on multiple benchmark datasets**: The performance of EERCF on datasets such as MSRVTT - 1K - Test, MSRVTT - 3K - Test, VATEX, and ActivityNet is close to or exceeds the current state - of - the - art methods, while the computational complexity is reduced by about 14 times, 39 times, 20 times, and 126 times respectively. Through these innovations, EERCF not only improves the effect of text - to - video retrieval, but also significantly improves the retrieval efficiency, providing a better solution for practical applications.

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Coarse-to-fine dual-level attention for video-text cross modal retrieval

Fine-grained Text-Video Retrieval with Frozen Image Encoders

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

Text-Video Retrieval with Global-Local Semantic Consistent Learning

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

UATVR: Uncertainty-Adaptive Text-Video Retrieval