Action Recognition Via Fine-Tuned CLIP Model and Temporal Transformer.

Xiaoyu Yang,Yuzhuo Fu,Ting Liu
DOI: https://doi.org/10.1007/978-3-031-50075-6_39
2024-01-01
Abstract:Contrastive image-text pre-trained model, i.e. CLIP, has been proved successful transferring to the video domain. It shows remarkable “zero-shot” generalization ability for various large-scale datasets. However, most researches are based on the datasets like Kinetics and UCF-101. These datasets focus more on appearance rather than temporal order information. In other words, training on these datasets may not reward good temporal understanding in videos. We want to capture the long-range dependencies of frames along the temporal dimension. In this paper, we deal with this problem by applying a temporal transformer module and the backbone fine-tuning strategy. Fine-tuning the backbone model helps the image based model fits the video environment, and the temporal transformer module captures detailed spatiotemporal information We mainly focus the performance on the action-centered dataset Something V2 because it contains a large proportion of temporal classes. We adopt the language-image pretrained models like CLIP to further study the zero-shot ability.
What problem does this paper attempt to address?