Abstract:The discussion of compositional generalization in action recognition, i.e ., Compositional Action Recognition (CAR), has recently received increasing attention. CAR challenges models to recognize unseen combinations of actions and objects, with the primary challenge being the distribution shift from training to testing. Most previous approaches for CAR incorporate supplementary object annotations ( e.g . bounding boxes and objects categories) to learn an instance-centric dynamic representation. However, these methods inevitably introduce stronger visual inductive bias, including object appearance and background bias, that impact generalization performance, particularly in out-of-distribution scenarios. To this end, this work attempts to construct an appearance-agnostic de-biased representation by leveraging the powerful segmentation capability of Segment Anything Model (SAM), which is the first exploration of SAM in the field of compositional action recognition. Specifically, we propose a novel SAM-driven Appearance-Agnostic Representation Learning (A 2 RL) framework for CAR, which contains two effective sub-modules: Fore-Back Mask (FBM) and Dynamic Relation Modeling (DRM). In FBM, we design a fine-grained instance-invisible and background-removed masking strategy to effectively weaken the strong connection between visual cues and action labels, as well as minimize the impact of irrelevant factors. In DRM, we explore the potential association between subjects and objects involved in one action and then build appearance-agnostic relational descriptors for dynamic modeling. Extensive experiments demonstrate the generalization ability of this work. Notably, FBM achieves significant improvements in all three compositional settings without adding any additional model parameters. The proposed also gains state-of-the-art performance in comparison with the most recent methods in CAR.

Progressive Instance-Aware Feature Learning for Compositional Action Recognition.

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Appearance-Agnostic Representation Learning for Compositional Action Recognition

Compositional Structure Learning for Action Understanding

Look Less Think More: Rethinking Compositional Action Recognition

Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

Motion Stimulation for Compositional Action Recognition

Learning Comprehensive Motion Representation for Action Recognition

Reassessing Hierarchical Representation for Action Recognition in Still Images

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Online Robust Action Recognition Based on a Hierarchical Model

Compositional Zero-shot Learning Via Progressive Language-based Observations

Modelling Spatio-Temporal Interactions For Compositional Action Recognition

Learning Discriminative Features for Fast Frame-Based Action Recognition.

Semi-Supervised Multiple Feature Analysis for Action Recognition

Home Action Genome: Cooperative Compositional Action Understanding

Hierarchical compositional representations for few-shot action recognition