Abstract:We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which result in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper proposes a new skeleton action recognition framework called SkeleTR, aiming to address the multi-task action recognition problem in real-world scenarios. Unlike previous works that were studied only in controlled environments, SkeleTR targets more complex real-world scenarios involving a variable number of people and various forms of interactions between them. #### Main Objectives: 1. **Handling action recognition in complex environments**: Existing methods mainly focus on simplified and controlled environments, but they perform poorly in the real world. SkeleTR tackles this challenge by adopting a two-stage paradigm. 2. **Reducing the impact of skeleton association errors**: Skeleton association errors in long sequences can lead to a significant amount of noise. SkeleTR mitigates this by inputting shorter skeleton sequences and increasing the number of sequences to improve accuracy. 3. **Modeling interactions between people**: Previous works rarely focus on the importance of group interactions. SkeleTR uses a Transformer encoder to capture interactions between different individuals, which is crucial for action recognition in general scenarios. #### Specific Applications: - **Video-level action classification**: Identifying the action category in an entire video. - **Instance-level action detection**: Locating the actions of each person at specific time points in a video. - **Group activity recognition**: Recognizing the types of activities involving multiple people. #### Technical Innovations: - **Two-stage paradigm**: First, a Graph Convolutional Network (GCN) is used to model each skeleton sequence, and then stacked Transformer encoders capture the relationships between sequences. - **Flexible input format**: Sampling a large number of shorter skeleton sequences instead of a few longer ones to reduce association errors. - **Hybrid pooling layer**: Compressing spatiotemporal features while maintaining computational feasibility. - **Joint training**: Allowing transfer learning across datasets and tasks to enhance generalization on small datasets. Through these methods, SkeleTR achieves state-of-the-art performance on various benchmarks.

SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Multi-Scale Adaptive Skeleton Transformer for action recognition

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Quo Vadis, Skeleton Action Recognition?

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition

Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation

Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition

One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching

Learning Discriminative Trajectorylet Detector Sets for Accurate Skeleton-Based Action Recognition

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network

A New Representation of Skeleton Sequences for 3D Action Recognition

Action Recognition Scheme Based on Skeleton Representation with DS-LSTM Network.

2D human skeleton action recognition with spatial constraints

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

Multi-Modal Transformer with Skeleton and Text for Action Recognition

MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition