TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Leqi Shen,Tianxiang Hao,Sicheng Zhao,Yifeng Zhang,Pengzhang Liu,Yongjun Bao,Guiguang Ding

2024-09-02

Abstract:Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone, incorporating complex modules that result in high computational overhead. As a result, many studies focus on efficient fine-tuning. The primary challenge in efficient adaption arises from the inherent differences between image and video modalities. Each sampled video frame must be processed by the image encoder independently, which increases complexity and complicates practical deployment. Although existing efficient methods fine-tune with small trainable parameters, they still incur high inference costs due to the large token number. In this work, we argue that temporal redundancy significantly contributes to the model's high complexity due to the repeated information in consecutive frames. Existing token compression methods for image models fail to solve the unique challenges, as they overlook temporal redundancy across frames. To tackle these problems, we propose Temporal Token Merging (TempMe) to reduce temporal redundancy. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we merge temporal tokens across different frames and learn video-level features, leading to lower complexity and better performance. Extensive experiments validate the superiority of our TempMe. Compared to previous efficient text-video retrieval methods, TempMe significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. Additionally, TempMe exhibits robust generalization capabilities by integrating effectively with both efficient and full fine-tuning methods. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. Our code will be released.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of efficient fine-tuning in the task of text-video retrieval. Specifically, the paper focuses on the following points: 1. **Problems with existing methods**: - Most text-video retrieval methods utilize pre-trained CLIP as the backbone network and introduce complex modules, leading to high computational overhead. - The inherent differences between image and video modalities require each sampled video frame to be processed independently through the image encoder, increasing complexity and affecting practical deployment. - Although some efficient methods fine-tune with a small number of parameters, the large number of tokens still results in high inference costs. 2. **Temporal redundancy issue**: - The paper points out that redundant information between consecutive frames (temporal redundancy) significantly increases model complexity. - Existing token compression methods mainly target image models and overlook the issue of temporal redundancy across frames in videos. 3. **Proposed solution**: - A new method called Temporal Token Merging (TempMe) is proposed to reduce temporal redundancy. - A progressive multi-granularity framework is introduced, which extracts video-level features by gradually merging tokens of adjacent segments, thereby reducing complexity and improving performance. - Experimental results show that compared to previous efficient text-video retrieval methods, TempMe can significantly reduce the number of output tokens (95%), lower GFLOPs (51%), achieve a 1.8x speedup, and improve R-Sum by 4.4%. In summary, this paper is primarily dedicated to improving the efficiency of text-video retrieval tasks by reducing temporal redundancy and proposes an effective method to address this issue.

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Motion Guided Token Compression for Efficient Masked Video Modeling

VidToMe: Video Token Merging for Zero-Shot Video Editing

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

TEINet: Towards an Efficient Architecture for Video Recognition.

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity

Video Token Merging for Long-form Video Understanding

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

Token Merging: Your ViT But Faster

Temporal Enhancement for Video Affective Content Analysis

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

Video Editing for Video Retrieval

Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

Fast Video Deduplication and Localization with Temporal Consistence Re-Ranking

ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

Efficient Video Segmentation Models with Per-frame Inference

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding