Discriminative Segment Focus Network for Fine-grained Video Action Recognition

Baoli Sun,Xinchen Ye,Tiantian Yan,Zhihui Wang,Haojie Li,Zhiyong Wang
DOI: https://doi.org/10.1145/3654671
2024-03-26
Abstract:Fine-grained video action recognition aims to identify minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e. , FineGym and Diving48, and two action recognition datasets, i.e. , Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?
This paper focuses on the problem of fine-grained video action recognition, which aims to identify subtle and discriminative action differences among different categories. Existing action recognition methods have made progress in modeling spatiotemporal features, but they neglect the discriminative interactions between atomic actions, which are crucial for understanding fine-grained actions. To address this issue, this paper proposes a new method called Discriminative Segment Focus Network (DSFNet), which enhances the performance of fine-grained video action recognition by mining discriminative and localized segments that are relevant to actions. DSFNet consists of a video branch and a segment branch. The video branch extracts global semantic features, while the segment branch includes Hierarchical Context Reasoning (HCR) module and Discriminative Segment Focus (DSF) module. The HCR module establishes fragment relationships at different time scales and enhances each segment through graph refinement. The DSF module locates the most discriminative segments using consistency constraints, which is a self-supervised learning mechanism for localizing key segments without frame-level temporal annotations. Experimental results demonstrate that DSFNet outperforms existing methods on multiple action recognition datasets, including FineGym, Dividing48, Kinetics400, and Something-Something, highlighting its effectiveness in fine-grained action recognition.