Abstract:Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone, incorporating complex modules that result in high computational overhead. As a result, many studies focus on efficient fine-tuning. The primary challenge in efficient adaption arises from the inherent differences between image and video modalities. Each sampled video frame must be processed by the image encoder independently, which increases complexity and complicates practical deployment. Although existing efficient methods fine-tune with small trainable parameters, they still incur high inference costs due to the large token number. In this work, we argue that temporal redundancy significantly contributes to the model's high complexity due to the repeated information in consecutive frames. Existing token compression methods for image models fail to solve the unique challenges, as they overlook temporal redundancy across frames. To tackle these problems, we propose Temporal Token Merging (TempMe) to reduce temporal redundancy. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we merge temporal tokens across different frames and learn video-level features, leading to lower complexity and better performance. Extensive experiments validate the superiority of our TempMe. Compared to previous efficient text-video retrieval methods, TempMe significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. Additionally, TempMe exhibits robust generalization capabilities by integrating effectively with both efficient and full fine-tuning methods. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. Our code will be released.

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval.

Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

A Global Approach for Video Matching

Video object matching across multiple non-overlapping camera views based on multi-feature fusion and incremental learning.

Video Text Tracking With a Spatio-Temporal Complementary Model

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

UATVR: Uncertainty-Adaptive Text-Video Retrieval

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Text-driven Video Prediction

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval