Abstract:(1) A novel 2D human skeleton action recognition model with spatial constraints, named 2D‐SCHAR, is introduced to address the ambiguity and uncertainty associated with human action recognition in 2D surveillance videos. (2) These issues stem from the absence of depth information in the action data, thus we concentrate on two main challenges: depth estimation and spatial transformation, to enhance recognition accuracy. (3) The depth estimation component aims to reconstruct 3D action data from 2D inputs, while the spatial transformation employs spatial constraints to adjust and rectify the 3D action data. Human actions are predominantly presented in 2D format in video surveillance scenarios, which hinders the accurate determination of action details not apparent in 2D data. Depth estimation can aid human action recognition tasks, enhancing accuracy with neural networks. However, reliance on images for depth estimation requires extensive computational resources and cannot utilise the connectivity between human body structures. Besides, the depth information may not accurately reflect actual depth ranges, necessitating improved reliability. Therefore, a 2D human skeleton action recognition method with spatial constraints (2D‐SCHAR) is introduced. 2D‐SCHAR employs graph convolution networks to process graph‐structured human action skeleton data comprising three parts: depth estimation, spatial transformation, and action recognition. The initial two components, which infer 3D information from 2D human skeleton actions and generate spatial transformation parameters to correct abnormal deviations in action data, support the latter in the model to enhance the accuracy of action recognition. The model is designed in an end‐to‐end, multitasking manner, allowing parameter sharing among these three components to boost performance. The experimental results validate the model's effectiveness and superiority in human skeleton action recognition.

SkeletonCLIP: Recognizing Skeleton-based Human Actions with Text Prompts

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Advancing Human Motion Recognition with SkeletonCLIP++: Weighted Video Feature Integration and Enhanced Contrastive Sample Discrimination

Semantic-guided multi-scale human skeleton action recognition

Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network

Skeleton-Based Human Action Recognition with Noisy Labels

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Action Recognition Scheme Based on Skeleton Representation with DS-LSTM Network.

Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition

2D human skeleton action recognition with spatial constraints

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

Multi-Modal Transformer with Skeleton and Text for Action Recognition

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation

Enhancing Skeleton-Based Action Recognition with Language Descriptions from Pre-trained Large Multimodal Models

Prompt-supervised dynamic attention graph convolutional network for skeleton-based action recognition

Skeleton edge motion networks for human action recognition

Multiple temporal scale aggregation graph convolutional network for skeleton-based action recognition

Multi-Scale Enhanced Active Learning for Skeleton-Based Action Recognition