Abstract:Fine-grained video action recognition aims to identify minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e. , FineGym and Diving48, and two action recognition datasets, i.e. , Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.

Action Recognition Through Discovering Distinctive Action Parts

Discovering distinctive action parts for action recognition

Mid-Level Parts Mined By Feature Selection For Action Recognition

Discriminative Part Selection for Human Action Recognition.

Action Recognition by Mid-Level Discriminative Spatial-Temporal Volume

Group Sparse-Based Mid-Level Representation for Action Recognition

Action Recognition by Exploring Data Distribution and Feature Correlation

Action Recognition by Hierarchical Mid-level Action Elements

Semi-Supervised Multiple Feature Analysis for Action Recognition

Action recognition using a hierarchy of feature groups

Action Recognition Based on A Selective Sampling Strategy for Real-Time Video Surveillance

Action Recognition Using Hybrid Feature Descriptor And Vlad Video Encoding

Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches

Discriminative Segment Focus Network for Fine-grained Video Action Recognition

Human Action Recognition Based on Extracted Discriminative Regions

Action Recognition with Actons

Action Recognition Using Nonnegative Action Component Representation and Sparse Basis Selection

Semantic Parts Based Top-Down Pyramid for Action Recognition

Action Recognition Using Form and Motion Modalities

Action Recognition by Saliency-Based Dense Sampling

Action Recognition Based on Depth Image Sequence