Abstract:Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce PoseWatch, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. PoseWatch features an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that enhances the representation of human motion over time, which is also beneficial for broader human behavior analysis tasks. The architecture's core, a Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that PoseWatch consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD. This work not only demonstrates the efficacy of PoseWatch but also highlights the potential of integrating Natural Language Processing techniques with computer vision to advance human behavior analysis.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to automatically detect abnormal behaviors in videos, especially the abnormal detection of human behaviors. Specifically, the paper focuses on how to effectively identify unusual human behaviors in complex and changeable environments. These behaviors may be falls, physical conflicts or causing abnormal congestion in public places, etc. The challenges of such tasks lie in the unpredictability and diversity of abnormal events, as well as potential biases and privacy issues in the data. To address these challenges, the paper proposes a pose - based video - anomaly - detection method - PoseWatch. This method uses human poses as high - level features, aiming to alleviate privacy issues, reduce appearance biases, and minimize background interference. The core of PoseWatch is a Transformer architecture named Unified Encoder Twin Decoders (UETD), which enhances the ability to represent human motion over time by introducing an innovative spatio - temporal pose and relative pose (ST - PRP) tokenization method. This not only helps to more accurately detect abnormal behaviors in video data, but also provides support for a broader range of human - behavior - analysis tasks. In general, the main contributions of the paper include: 1. Introducing the ST - PRP tokenization method, which is a new method for pose tokenization, and demonstrating its advantages through extensive ablation studies. 2. Proposing PoseWatch, a model that combines the novel non - autoregressive UETD Transformer and ST - PRP tokenization, including the Current Target Decoder (CTD) and the Future Target Decoder (FTD), for self - supervised human - anomaly - detection. 3. Demonstrating the accuracy and generalization ability of PoseWatch by comparing it with state - of - the - art pose - based methods and pixel - based methods on multiple benchmark datasets. Through these contributions, the paper not only improves the performance of video - anomaly - detection, but also emphasizes the potential of combining natural - language - processing techniques with computer vision in advancing human - behavior - analysis.

PoseWatch: A Transformer-based Architecture for Human-centric Video Anomaly Detection Using Spatio-temporal Pose Tokenization

An Exploratory Study on Human-Centric Video Anomaly Detection through Variational Autoencoders and Trajectory Prediction

Understanding the Challenges and Opportunities of Pose-based Anomaly Detection

Enhancing Video Anomaly Detection Using a Transformer Spatiotemporal Attention Unsupervised Framework for Large Datasets

Evaluating the Effectiveness of Video Anomaly Detection in the Wild: Online Learning and Inference for Real-world Deployment

Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

Pose-Motion Video Anomaly Detection via Memory-Augmented Reconstruction and Conditional Variational Prediction

Hierarchical Graph Embedded Pose Regularity Learning via Spatio-Temporal Transformer for Abnormal Behavior Detection

Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-temporal Masked Transformers

Normalizing Flows for Human Pose Anomaly Detection

Human Kinematics-inspired Skeleton-based Video Anomaly Detection

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

Anomaly detection in surveillance videos using transformer based attention model

Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection

Configurable Spatial-Temporal Hierarchical Analysis for Flexible Video Anomaly Detection

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Memory Enhanced Spatial-Temporal Graph Convolutional Autoencoder for Human-Related Video Anomaly Detection.

Anomaly detection in surveillance videos using Transformer with margin learning

Video Anomaly Detection Based on Spatio-Temporal Relationships among Objects

Open-Vocabulary Video Anomaly Detection

TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection