Discriminative Segment Focus Network for Fine-grained Video Action Recognition

Baoli Sun,Xinchen Ye,Tiantian Yan,Zhihui Wang,Haojie Li,Zhiyong Wang

DOI: https://doi.org/10.1145/3654671

2024-03-26

Abstract:Fine-grained video action recognition aims to identify minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e. , FineGym and Diving48, and two action recognition datasets, i.e. , Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.

computer science, information systems, theory & methods, software engineering

What problem does this paper attempt to address?

This paper focuses on the problem of fine-grained video action recognition, which aims to identify subtle and discriminative action differences among different categories. Existing action recognition methods have made progress in modeling spatiotemporal features, but they neglect the discriminative interactions between atomic actions, which are crucial for understanding fine-grained actions. To address this issue, this paper proposes a new method called Discriminative Segment Focus Network (DSFNet), which enhances the performance of fine-grained video action recognition by mining discriminative and localized segments that are relevant to actions. DSFNet consists of a video branch and a segment branch. The video branch extracts global semantic features, while the segment branch includes Hierarchical Context Reasoning (HCR) module and Discriminative Segment Focus (DSF) module. The HCR module establishes fragment relationships at different time scales and enhances each segment through graph refinement. The DSF module locates the most discriminative segments using consistency constraints, which is a self-supervised learning mechanism for localizing key segments without frame-level temporal annotations. Experimental results demonstrate that DSFNet outperforms existing methods on multiple action recognition datasets, including FineGym, Dividing48, Kinetics400, and Something-Something, highlighting its effectiveness in fine-grained action recognition.

Discriminative Segment Focus Network for Fine-grained Video Action Recognition

Temporal Distinct Representation Learning for Action Recognition

DC3D: A Video Action Recognition Network Based on Dense Connection

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Dynamic Spatio-Temporal Specialization Learning for Fine-Grained Action Recognition

Diffused Fourier Network for Video Action Segmentation

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Temporal Segment Networks for Action Recognition in Videos

Dense Semantics-Assisted Networks For Video Action Recognition

SpatioTemporal Focus for Skeleton-based Action Recognition

DSTC-Net: differential spatio-temporal correlation network for similar action recognition

Multidimensional Refinement Graph Convolutional Network With Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition

Sequential Segment Networks for Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Fusion detection network with discriminative enhancement for weakly-supervised temporal action localization

Actionness-pooled Deep-convolutional Descriptor for Fine-Grained Action Recognition.

DIR-AS: Decoupling Individual Identification and Temporal Reasoning for Action Segmentation

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Learning Discriminative Representations for Skeleton Based Action Recognition

Gated forward refinement network for action segmentation

Multi-Dimensional Refinement Graph Convolutional Network with Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition