Appearance-Agnostic Representation Learning for Compositional Action Recognition
Peng Huang,Xiangbo Shu,Rui Yan,Zhewei Tu,Jinhui Tang
DOI: https://doi.org/10.1109/tcsvt.2024.3384392
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:The discussion of compositional generalization in action recognition, i.e ., Compositional Action Recognition (CAR), has recently received increasing attention. CAR challenges models to recognize unseen combinations of actions and objects, with the primary challenge being the distribution shift from training to testing. Most previous approaches for CAR incorporate supplementary object annotations ( e.g . bounding boxes and objects categories) to learn an instance-centric dynamic representation. However, these methods inevitably introduce stronger visual inductive bias, including object appearance and background bias, that impact generalization performance, particularly in out-of-distribution scenarios. To this end, this work attempts to construct an appearance-agnostic de-biased representation by leveraging the powerful segmentation capability of Segment Anything Model (SAM), which is the first exploration of SAM in the field of compositional action recognition. Specifically, we propose a novel SAM-driven Appearance-Agnostic Representation Learning (A 2 RL) framework for CAR, which contains two effective sub-modules: Fore-Back Mask (FBM) and Dynamic Relation Modeling (DRM). In FBM, we design a fine-grained instance-invisible and background-removed masking strategy to effectively weaken the strong connection between visual cues and action labels, as well as minimize the impact of irrelevant factors. In DRM, we explore the potential association between subjects and objects involved in one action and then build appearance-agnostic relational descriptors for dynamic modeling. Extensive experiments demonstrate the generalization ability of this work. Notably, FBM achieves significant improvements in all three compositional settings without adding any additional model parameters. The proposed also gains state-of-the-art performance in comparison with the most recent methods in CAR.