Abstract:Fine-grained action recognition is a challenging task that requires identifying discriminative and subtle motion variations among fine-grained action classes. Existing methods typically focus on spatio-temporal feature extraction and long-temporal modeling to characterize complex spatio-temporal patterns of fine-grained actions. However, the learned spatio-temporal features without explicit motion modeling may emphasize more on visual appearance than on motion, which could compromise the learning of effective motion features required for fine-grained temporal reasoning. Therefore, how to decouple robust motion representations from the spatio-temporal features and further effectively leverage them to enhance the learning of discriminative features still remains less explored, which is crucial for fine-grained action recognition. In this paper, we propose a motion representation decoupling and concentration network (MDCNet) to address these two key issues. First, we devise a motion representation decoupling (MRD) module to disentangle the spatio-temporal representation into appearance and motion features through contrastive learning from video and segment views. Next, in the proposed motion representation concentration (MRC) module, the decoupled motion representations are further leveraged to learn a universal motion prototype shared across all the instances of each action class. Finally, we project the decoupled motion features onto all the motion prototypes through semantic relations to obtain the concentrated action-relevant features for each action class, which can effectively characterize the temporal distinctions of fine-grained actions for improved recognition performance. Comprehensive experimental results on four widely used action recognition benchmarks, i.e., FineGym, Diving48, Kinetics400 and Something-Something, clearly demonstrate the superiority of our proposed method in comparison with other state-of-the-art ones.

Pipelining Localized Semantic Features For Fine-Grained Action Recognition

A Method of Simultaneously Action Recognition and Video Segmentation of Video Streams.

Learning Semantic-Aligned Action Representation.

Part-level Action Parsing Via a Pose-guided Coarse-to-Fine Framework

Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches

Multi-stream I3D Network for Fine-grained Action Recognition

Action Recognition Using Feature Position Constrained Linear Coding

Multiple Granularity Analysis for Fine-Grained Action Detection

Progressively Parsing Interactional Objects for Fine Grained Action Detection.

Global for Coarse and Part for Fine: A Hierarchical Action Recognition Framework

Interaction Part Mining: A Mid-Level Approach For Fine-Grained Action Recognition

Fusing $${\mathcal {R}}$$R Features and Local Features with Context-Aware Kernels for Action Recognition

Spatio-temporal Semantic Features for Human Action Recognition.

Storyboard guided Alignment for Fine-grained Video Action Recognition

Multiscale Spatial Position Coding under Locality Constraint for Action Recognition

Semantic-Augmented Local Decision Aggregation Network for Action Recognition.

Multiple Granularity Modeling: A Coarse-to-Fine Framework for Fine-grained Action Analysis.

Semi-Supervised Multiple Feature Analysis for Action Recognition

Progressive Instance-Aware Feature Learning for Compositional Action Recognition.

Fine-grained Action Recognition with Robust Motion Representation Decoupling and Concentration