Are Visual-Language Models Effective in Action Recognition? A Comparative Study

Mahmoud Ali,Di Yang,François Brémond
2024-10-23
Abstract:Current vision-language foundation models, such as CLIP, have recently shown significant improvement in performance across various downstream tasks. However, whether such foundation models significantly improve more complex fine-grained action recognition tasks is still an open question. To answer this question and better find out the future research direction on human behavior analysis in-the-wild, this paper provides a large-scale study and insight on current state-of-the-art vision foundation models by comparing their transfer ability onto zero-shot and frame-wise action recognition tasks. Extensive experiments are conducted on recent fine-grained, human-centric action recognition datasets (e.g., Toyota Smarthome, Penn Action, UAV-Human, TSU, Charades) including action classification and segmentation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the effectiveness and limitations of visual - language models in complex, fine - grained action recognition tasks. Specifically, the author focuses on the following aspects: 1. **Zero - shot action classification**: - Zero - shot action classification refers to performing action classification on an unseen dataset without retraining the model. Compared with traditional methods, zero - shot methods aim to generalize from known actions to unknown actions. However, since visual features (such as shape, color, and motion) are usually low - level, and action descriptions are more abstract, it is difficult for the model to accurately match these two types of features. 2. **Frame - level action segmentation**: - Frame - level action segmentation focuses on classifying the activity of each frame in an untrimmed video. The main challenge is how to model the long - term relationships between different time steps. Specifically, action segmentation involves automatically dividing an untrimmed video sequence into multiple segments, each corresponding to a coherent action. 3. **Evaluating the generalization ability of existing models**: - The author selects some of the latest multi - modal video - based models (such as CLIP, X - CLIP, ViCLIP, ViFi - CLIP, etc.) and evaluates the performance of these models on multiple real - world datasets. These datasets include Toyota Smarthome, UAV - Human, Penn Action, NTU - RGB + D, etc., covering human behavior analysis tasks in daily life. 4. **Exploring improvement directions**: - The paper points out that the current state - of - the - art visual - language - based models still face challenges in handling complex actions and long - term temporal consistency. To improve these models, the author suggests using more modalities (such as audio and geometric information) to supplement visual information and designing more effective temporal modeling methods to capture long - term temporal reasoning. In addition, using large - language models (LLM) to enhance the understanding of action descriptions can also improve the effect of zero - shot classification. ### Main contributions 1. **Large - scale study**: Evaluate the transfer learning ability of current visual - language - based models in real - world action recognition tasks. 2. **Strategy comparison**: Provide insights and comparisons of different action description generation strategies (for zero - shot action classification) and different frame - level action prediction strategies (using video - question - answering models for zero - shot action segmentation). 3. **Experimental verification**: Conduct extensive experiments on multiple challenging benchmark datasets to verify the performance of different models. Through these studies, the author hopes to provide directions for future research, especially in multi - modal data integration and fine - tuning strategies, to improve the performance of action recognition.