LAGA-Net: Local-and-Global Attention Network for Skeleton Based Action Recognition

Rongjie Xia,Yanshan Li,Wenhan Luo
DOI: https://doi.org/10.1109/tmm.2021.3086758
IF: 7.3
2021-01-01
IEEE Transactions on Multimedia
Abstract:Skeleton-based action recognition has attracted significant attention and obtained widespread applications due to the robustness of 3D skeleton data. One of the key challenges is how to extract discriminative and robust spatio-temporal features from sparse skeleton data to describe actions and improve recognition accuracy. To address this issue, this paper combines convolutions with attention mechanisms and proposes a deep network for skeleton-based action recognition, termed as local-and-global attention network (LAGA-Net). First, we encode skeleton sequences into joint feature evolution maps to compactly describe the spatial and temporal characteristics of skeleton sequences. Then, a motion guided channel attention module (MGCAM) is proposed to model the interdependencies between feature channels by calculating temporal frame-level motion and enhance motion-salient features in a channel-wise way. Further, a spatio-temporal attention module (STAM) is proposed to model spatio-temporal context-aware collaboration at sequence level and extract spatio-temporal attention features that involve long-range dependencies. Together, MGCAM and STAM are combined to form LAGA-Net, which extracts discriminative features integrating both local and global representations of skeleton sequences. Moreover, a two-stream architecture is proposed to learn complementary features from joint and bone aspects. We conduct extensive experiments to verify the effectiveness and superiority of our proposed method over state-of-the-art approaches on several benchmarks (e.g., NTU RGB+D, Northwestern-UCLA, UTD-MHAD and NTU RGB+D 120).
What problem does this paper attempt to address?