OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

Mushui Liu,Bozheng Li,Yunlong Yu
2024-08-12
Abstract:Recent Vision-Language Models (VLMs) \textit{e.g.} CLIP have made great progress in video recognition. Despite the improvement brought by the strong visual backbone in extracting spatial features, CLIP still falls short in capturing and integrating spatial-temporal features which is essential for video recognition. In this paper, we propose OmniCLIP, a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales, which we refer to as omni-scale features. This is achieved through the design of spatial-temporal blocks that include parallel temporal adapters (PTA), enabling efficient temporal modeling. Additionally, we introduce a self-prompt generator (SPG) module to capture dynamic object spatial features. The synergy between PTA and SPG allows OmniCLIP to discern varying spatial information across frames and assess object scales over time. We have conducted extensive experiments in supervised video recognition, few-shot video recognition, and zero-shot recognition tasks. The results demonstrate the effectiveness of our method, especially with OmniCLIP achieving a top-1 accuracy of 74.30\% on HMDB51 in a 16-shot setting, surpassing the recent MotionPrompt approach even with full training data. The code is available at \url{<a class="link-external link-https" href="https://github.com/XiaoBuL/OmniCLIP" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing image - text pre - training models (such as CLIP) in handling video data in video recognition tasks. Specifically, although CLIP performs excellently in extracting spatial features, it is insufficient in capturing and integrating spatio - temporal features in videos. This is mainly reflected in two challenges: 1. **Dynamic object tracking**: Objects in videos need to be recognized not only in each frame but also to understand the action changes across multiple frames. However, because CLIP was originally designed to handle static image - text pairs, it is difficult to effectively track object motion and the continuity between frames in video recognition tasks. 2. **Managing video continuity**: Unlike static images, videos change over time, and the characteristics of objects and scenes also change. The model needs to consider these changes, including changes in object size, appearance, and behavior. To solve the above problems, the paper proposes the OmniCLIP framework. By introducing the Parallel Temporal Adapter (PTA) and Self - Prompt Generator (SPG) modules, it enhances CLIP's spatio - temporal feature learning ability in video recognition. Specifically: - **Parallel Temporal Adapter (PTA)**: PTA aggregates information at the same spatial location in the time dimension through the self - attention mechanism, thereby efficiently capturing time cues in videos. PTA works in parallel with the frozen spatial CLIP block and integrates spatial information through a simple learnable addition operation, thereby achieving time adaptation while maintaining computational efficiency. - **Self - Prompt Generator (SPG)**: SPG uses average pooling and learnable projectors to extract multi - scale information, enhancing CLIP's spatial feature extraction ability, especially suitable for handling different resolutions and irregular motions of objects in videos. Through these designs, OmniCLIP can more comprehensively understand and represent video content in video recognition tasks, thereby achieving significant performance improvements on multiple benchmark datasets. For example, on the HMDB51 dataset, OmniCLIP achieves a Top - 1 accuracy of 74.30% in the 16 - shot setting, exceeding the MotionPrompt method using complete training data. In addition, OmniCLIP also performs excellently in resource efficiency and can achieve high accuracy at low computational cost, making it an ideal choice in resource - constrained environments.