Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models

Yinuo Jing,Chunyu Wang,Ruxu Zhang,Kongming Liang,Zhanyu Ma
DOI: https://doi.org/10.1145/3581783.3612551
2023-01-01
Abstract:Animal action recognition has a wide range of applications. However, the field largely remains unexplored due to the greater challenges compared to human action recognition, such as lack of annotated training data, large intra-class variation, and interference of cluttered background. Most of the existing methods directly apply human action recognition techniques, which essentially require a large amount of annotated data. In recent years, contrastive vision-language pretraining has demonstrated strong zero-shot generalization ability and has been used for human action recognition. Inspired by the success, we develop a highly performant action recognition framework based on the CLIP model. Our model addresses the above challenges via a novel category-specific prompting module to generate adaptive prompts for both text and video based on the animal category detected in input videos. On one hand, it can generate more precise and customized textual descriptions for each action and animal category pair, being helpful in the alignment of textual and visual space. On the other hand, it allows the model to focus on video features of the target animal in the video and reduce the interference of video background noise. Experimental results demonstrate that our method outperforms five previous action recognition methods on the Animal Kingdom dataset and has shown best generalization ability on unseen animals.
What problem does this paper attempt to address?