Abstract:Compressed videos offer a compelling alternative to raw videos, showing the possibility to significantly reduce the on-line computational and storage cost. However, current approaches to compressed video processing generally follow the resource-consuming pre-training and fine-tuning paradigm, which does not fully take advantage of such properties, making them not favorable enough for widespread applications. Inspired by recent successes of prompt tuning techniques in computer vision, this paper presents the first attempt to build a prompt based representation learning framework, which enables effective and efficient adaptation of pre-trained raw video models to compressed video understanding tasks. To this end, we propose a novel prompt tuning approach, namely Compressed Video Prompt Tuning (CVPT), emphatically dealing with the challenging issue caused by the inconsistency between pre-training and downstream data modalities. Specifically, CVPT replaces the learnable prompts with compressed modalities (\emph{e.g.} Motion Vectors and Residuals) by re-parameterizing them into conditional prompts followed by layer-wise refinement. The conditional prompts exhibit improved adaptability and generalizability to instances compared to conventional individual learnable ones, and the Residual prompts enhance the noisy motion cues in the Motion Vector prompts for further fusion with the visual cues from I-frames. Additionally, we design Selective Cross-modal Complementary Prompt (SCCP) blocks. After inserting them into the backbone, SCCP blocks leverage semantic relations across diverse levels and modalities to improve cross-modal interactions between prompts and input flows. Extensive evaluations on HMDB-51, UCF-101 and Something-Something v2 demonstrate that CVPT remarkably outperforms the state-of-the-art counterparts, delivering a much better balance between accuracy and efficiency.

UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval

MPT: Multi-grained Prompt Tuning for Text-Video Retrieval

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval

Unified Vision and Language Prompt Learning

Compressed Video Prompt Tuning.

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Prompt Tuning for Unified Multimodal Pretrained Models.

Modality-Consistent Prompt Tuning with Optimal Transport

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

Towards Unified Prompt Tuning for Few-shot Text Classification

Generalizable Prompt Tuning for Vision-Language Models

Modal-aware Prompt Tuning with Deep Adaptive Feature Enhancement

Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion

Prompt-based Zero-shot Video Moment Retrieval

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Pro-tuning: Unified Prompt Tuning for Vision Tasks

Multitask Vision-Language Prompt Tuning