Abstract:Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on representative fully-supervised and few-shot video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

AIM: Adapting Image Models for Efficient Video Action Recognition

Object-centric Video Representation for Long-term Action Anticipation

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Synthesizing Videos from Images for Image-to-Video Adaptation

Adaptive Focus for Efficient Video Recognition

M-adapter: Multi-level image-to-video adaptation for video action recognition

Dynamic and Compressive Adaptation of Transformers From Images to Videos

Progressive Sparse Local Attention for Video Object Detection.

Adaptive Compact Attention For Few-shot Video-to-video Translation

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction.

Object-based (yet Class-agnostic) Video Domain Adaptation

Time-, Memory- and Parameter-Efficient Visual Adaptation

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

FE-Adapter: Adapting Image-based Emotion Classifiers to Videos

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Online Meta Adaptation for Fast Video Object Segmentation.