Abstract:The discussion of compositional generalization in action recognition, i.e ., Compositional Action Recognition (CAR), has recently received increasing attention. CAR challenges models to recognize unseen combinations of actions and objects, with the primary challenge being the distribution shift from training to testing. Most previous approaches for CAR incorporate supplementary object annotations ( e.g . bounding boxes and objects categories) to learn an instance-centric dynamic representation. However, these methods inevitably introduce stronger visual inductive bias, including object appearance and background bias, that impact generalization performance, particularly in out-of-distribution scenarios. To this end, this work attempts to construct an appearance-agnostic de-biased representation by leveraging the powerful segmentation capability of Segment Anything Model (SAM), which is the first exploration of SAM in the field of compositional action recognition. Specifically, we propose a novel SAM-driven Appearance-Agnostic Representation Learning (A 2 RL) framework for CAR, which contains two effective sub-modules: Fore-Back Mask (FBM) and Dynamic Relation Modeling (DRM). In FBM, we design a fine-grained instance-invisible and background-removed masking strategy to effectively weaken the strong connection between visual cues and action labels, as well as minimize the impact of irrelevant factors. In DRM, we explore the potential association between subjects and objects involved in one action and then build appearance-agnostic relational descriptors for dynamic modeling. Extensive experiments demonstrate the generalization ability of this work. Notably, FBM achieves significant improvements in all three compositional settings without adding any additional model parameters. The proposed also gains state-of-the-art performance in comparison with the most recent methods in CAR.

Look Less Think More: Rethinking Compositional Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Progressive Instance-Aware Feature Learning for Compositional Action Recognition.

Modelling Spatio-Temporal Interactions For Compositional Action Recognition

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Appearance-Agnostic Representation Learning for Compositional Action Recognition

Reassessing Hierarchical Representation for Action Recognition in Still Images

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

Compositional Structure Learning for Action Understanding

Hierarchical compositional representations for few-shot action recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Action Recognition in Still Images with Minimum Annotation Efforts

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Home Action Genome: Cooperative Compositional Action Understanding

Online Robust Action Recognition Based on a Hierarchical Model

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Compositional Zero-shot Learning Via Progressive Language-based Observations

Motion Stimulation for Compositional Action Recognition

Semantic-Disentangled Transformer With Noun-Verb Embedding for Compositional Action Recognition