Abstract:For RGB-based temporal action segmentation (TAS), excellent methods that capture frame-level features have achieved remarkable performance. However, for motion-centered TAS, it is still challenging for existing methods that ignore the extraction of spatial features of joints. In addition, inaccurate action boundaries caused by the frames of similar motion destroy the integrity of the action segments. To alleviate the issues, an end-to-end Involving Distinguished Temporal Graph Convolutional Networks called IDT-GCN is proposed. First, we construct an enhanced spatial graph structure that adaptively captures the similar and differential dependencies between joints in a single topology through learning two independent correlation modeling functions. Then, the proposed Involving Distinguished Graph Convolutional (ID-GC) models the spatial correlations of different actions in a video by using multiple enhanced topologies on the corresponding channels. Furthermore, we design a generic modeling temporal action regression network, termed Temporal Segment Regression (TSR), to extract segmented encoding features and action boundary representations by modeling action sequences. Combining them with label smoothing modules, we develop powerful spatial-temporal graph convolutional networks (IDT-GCN) for fine-grained TAS, which notably outperforms state-of-the-art methods on the MCFS-22 and MCFS-130 datasets. Adding TSR to TCN-based baseline methods achieves competitive performance compared with the state-of-the-art transformer-based methods on RGB-based datasets, i.e., Breakfast and 50Salads. Further experimental results on the action recognition task verify the superiority of the enhanced spatial graph structure over the previous graph convolutional networks.

Stacking-Based Attention Temporal Convolutional Network for Action Segmentation

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Involving Distinguished Temporal Graph Convolutional Networks for Skeleton-Based Temporal Action Segmentation

SG-TCN: Semantic Guidance Temporal Convolutional Network for Action Segmentation.

Spatio-Temporal Attention Networks for Action Recognition and Detection

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action Segmentation

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Joint Network based Attention for Action Recognition

Attentional Fused Temporal Transformation Network for Video Action Recognition.

STCA: an action recognition network with spatio-temporal convolution and attention

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Temporal Segment Transformer for Action Segmentation

Temporal Segment Networks for Action Recognition in Videos

Temporal Attentive Network for Action Recognition

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition

A motion-aware and temporal-enhanced Spatial–Temporal Graph Convolutional Network for skeleton-based human action segmentation