GBC: Guided Alignment and Adaptive Boosting CLIP Bridging Vision and Language for Robust Action Recognition

Zhaoqilin Yang,Gaoyun An,Zhenxing Zheng,Shan Cao,Qiuqi Ruan
DOI: https://doi.org/10.1109/tcsvt.2024.3390133
2024-01-01
Abstract:The Contrastive Language-Image Pre-training (CLIP) model achieves strong generalization by using a large number of text-image pairs for contrastive learning. However, when it is transferred to action recognition, the following two questions remain to be solved: 1) How to guide the model to focus more on human-body-related regions to better align actions and text, and 2) How to make the model strengthen itself in a targeted manner to deal with difficult-to-classify categories. To solve these problems, a Guided alignment and adaptive Boosting CLIP (GBC) is proposed, which employs visual prior knowledge and benefits from both feature and decision aggregation in a boosting manner. During early training, visual prior knowledge related to human body is adopted, which enables the model to better align human actions with category text to be robust to distribution shift. At the later stage of training, the CLIP encoder is frozen, and multiple downstream feature & decision aggregation modules are sequentially generated and trained. In such way, the model is able to boost the performance from different perspectives in the Boosting manner and at a linearly increasing cost. Moreover, a class-adaptive re-weighting strategy is proposed to make the model focus more on optimizing categories that are difficult to classify. The effectiveness of our model is validated on six action recognition datasets (Kinetics-600, Kinetics-400, Jester, HMDB-51, UCF-101, and Mini-Kinetics-200), including both fully supervised and zero-shot experiments. Our model achieves superior results compared to state-of-the-art methods on all datasets.
What problem does this paper attempt to address?