Pose-aware video action segmentation

Meijing Zhang,Chenyang Liao,Qi Li,Hua Zhang,Wenxi Liu
DOI: https://doi.org/10.1007/s00521-024-09920-7
2024-06-07
Neural Computing and Applications
Abstract:Action segmentation is an emerging task in video understanding, particularly for untrimmed videos containing multiple actions. However, existing video-based methods may struggle due to their sensitivity to visual factors, while skeleton-based methods may not capture sufficient information from human poses to accurately segment actions. To overcome this limitation, we propose a novel approach that leverages the complementary information of video and human poses synergistically for action segmentation. To the best of our knowledge, this is the first attempt to exploit the complementarity of video and poses for this task. Specifically, we introduce a cross-modal salient sampling module that attentively integrates human pose information with temporal visual features for action segmentation across modalities. Our approach achieves state-of-the-art performance on two benchmarks, demonstrating the efficacy of our method in leveraging both visual and pose information for action segmentation.
computer science, artificial intelligence
What problem does this paper attempt to address?