SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

Haodong Duan,Mingze Xu,Bing Shuai,Davide Modolo,Zhuowen Tu,Joseph Tighe,Alessandro Bergamo
2023-09-21
Abstract:We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which result in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper proposes a new skeleton action recognition framework called SkeleTR, aiming to address the multi-task action recognition problem in real-world scenarios. Unlike previous works that were studied only in controlled environments, SkeleTR targets more complex real-world scenarios involving a variable number of people and various forms of interactions between them. #### Main Objectives: 1. **Handling action recognition in complex environments**: Existing methods mainly focus on simplified and controlled environments, but they perform poorly in the real world. SkeleTR tackles this challenge by adopting a two-stage paradigm. 2. **Reducing the impact of skeleton association errors**: Skeleton association errors in long sequences can lead to a significant amount of noise. SkeleTR mitigates this by inputting shorter skeleton sequences and increasing the number of sequences to improve accuracy. 3. **Modeling interactions between people**: Previous works rarely focus on the importance of group interactions. SkeleTR uses a Transformer encoder to capture interactions between different individuals, which is crucial for action recognition in general scenarios. #### Specific Applications: - **Video-level action classification**: Identifying the action category in an entire video. - **Instance-level action detection**: Locating the actions of each person at specific time points in a video. - **Group activity recognition**: Recognizing the types of activities involving multiple people. #### Technical Innovations: - **Two-stage paradigm**: First, a Graph Convolutional Network (GCN) is used to model each skeleton sequence, and then stacked Transformer encoders capture the relationships between sequences. - **Flexible input format**: Sampling a large number of shorter skeleton sequences instead of a few longer ones to reduce association errors. - **Hybrid pooling layer**: Compressing spatiotemporal features while maintaining computational feasibility. - **Joint training**: Allowing transfer learning across datasets and tasks to enhance generalization on small datasets. Through these methods, SkeleTR achieves state-of-the-art performance on various benchmarks.