Abstract:This three‐level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. This is the first effort to progressively combine three‐level predictions from coarse to fine‐grained for VAD. We demonstrate the effectiveness of our framework by conducting an extensive experimental evaluation on the four publicly large‐scale benchmark datasets in both micro‐AUC and macro‐AUC metrics. Video Anomaly Detection (VAD) has been an active research field for several decades. However, most existing approaches merely extract a single type of feature from videos and define a single paradigm to indicate the extent of abnormalities. A coarse‐to‐fine three‐level prediction is built by integrating different levels of spatio‐temporal representations, better highlighting the difference between normal and abnormal behaviors. First, an object‐level trajectory prediction is proposed to model human historical position using a graph transformer network. Subsequently, skeleton‐level prediction is achieved by incorporating the positional information from the trajectory prediction. More importantly, based on the predicted skeleton, a skeleton‐guided pixel‐level region prediction is performed. A novel Skeleton Conditioned Generative Adversarial Network (SCGAN) is designed to explore the correlation between skeleton‐level and pixel‐level motion prediction. Benefiting from SCGAN, the prediction of human regions is contributed by both coarse‐grained and fine‐grained motion features. This three‐level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. Besides, a pixel‐level analysis method is proposed to achieve Background‐bias Elimination (BE) and denoise the predicted region. Experimental results validate the effectiveness of P3VAD on the four benchmark datasets (ShanghaiTech, CUHK Avenue, IITB‐Corridor, and ADOC).

Prediction-CGAN

Early Action Prediction with Generative Adversarial Networks

Pose Guided Global and Local GAN for Appearance Preserving Human Video Prediction

Deep Video Generation, Prediction and Completion of Human Action Sequences

Human Action Generation with Generative Adversarial Networks

Adaptive Graph Convolutional Network with Adversarial Learning for Skeleton-Based Action Prediction

Efficient Human Motion Prediction Using Temporal Convolutional Generative Adversarial Network

Adversarial Memory Networks for Action Prediction

Conditional Temporal Variational AutoEncoder for Action Video Prediction

Action Knowledge Transfer for Action Prediction with Partial Videos

Aggregated Multi-GANs for Controlled 3D Human Motion Prediction

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Recurrent Semantic Preserving Generation for Action Prediction

Pose-guided Generative Adversarial Net for Novel View Action Synthesis

3D Human motion anticipation and classification

Edge Guided Generation Network for Video Prediction

Action-conditioned video data improves predictability

Predictive Learning: Using Future Representation Learning Variantial Autoencoder for Human Action Prediction

Egocentric Early Action Prediction via Adversarial Knowledge Distillation

Action Selection Based on Prediction for Robot Planning

Progressive prediction: Video anomaly detection via multi‐grained prediction