Spatial-temporal Graph Transformer Network for Skeleton-Based Temporal Action Segmentation
Xiaoyan Tian,Ye Jin,Zhao Zhang,Peng Liu,Xianglong Tang
DOI: https://doi.org/10.1007/s11042-023-17276-8
IF: 2.577
2023-01-01
Multimedia Tools and Applications
Abstract:Temporal action segmentation (TAS) of minute-long untrimmed videos involves locating and classifying human action segments using multiple action class labels. Previously, research on this task typically involved generating an initial estimate using designed temporal convolutional layers and gradually refining this estimate solely based on RGB features. This approach, however, exhibits several limitations, including the inability to capture inherent long-range dependencies and insufficient consideration of intricate spatial-temporal correlations in the changing relationships between human joints. To address these constraints, we introduce a novel spatial-temporal graph transformer network (STGT) for the skeleton-based TAS task. Our STGT employs a series of skeleton graph transformer blocks (SGT blocks) within an encoder-decoder architecture. Particularly, the spatial-temporal graph layer with an adaptive graph strategy enhances the graph structure, rendering it more flexible and robust. Additionally, the spatial-temporal transformer layer in the SGT block constructs parallel attention mechanisms to model the dynamic spatial and non-linear temporal correlations. Integrating these advancements into the TAS task represents a notable achievement. Experimental results on three challenging datasets (PKU-MMD, HuGaDB, and LARa) indicate the improved performance of the proposed framework compared with that of existing TAS models (MS-TCN, ASRF, BCN, ETSN, and ASFormer). Furthermore, our approach effectively addresses concerns regarding over-segmentation errors and ambiguous boundaries.