Abstract:This three‐level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. This is the first effort to progressively combine three‐level predictions from coarse to fine‐grained for VAD. We demonstrate the effectiveness of our framework by conducting an extensive experimental evaluation on the four publicly large‐scale benchmark datasets in both micro‐AUC and macro‐AUC metrics. Video Anomaly Detection (VAD) has been an active research field for several decades. However, most existing approaches merely extract a single type of feature from videos and define a single paradigm to indicate the extent of abnormalities. A coarse‐to‐fine three‐level prediction is built by integrating different levels of spatio‐temporal representations, better highlighting the difference between normal and abnormal behaviors. First, an object‐level trajectory prediction is proposed to model human historical position using a graph transformer network. Subsequently, skeleton‐level prediction is achieved by incorporating the positional information from the trajectory prediction. More importantly, based on the predicted skeleton, a skeleton‐guided pixel‐level region prediction is performed. A novel Skeleton Conditioned Generative Adversarial Network (SCGAN) is designed to explore the correlation between skeleton‐level and pixel‐level motion prediction. Benefiting from SCGAN, the prediction of human regions is contributed by both coarse‐grained and fine‐grained motion features. This three‐level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. Besides, a pixel‐level analysis method is proposed to achieve Background‐bias Elimination (BE) and denoise the predicted region. Experimental results validate the effectiveness of P3VAD on the four benchmark datasets (ShanghaiTech, CUHK Avenue, IITB‐Corridor, and ADOC).

Structure Preserving Video Prediction

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Structure-Preserving Motion Estimation for Learned Video Compression

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Progressive Multi-granularity Analysis for Video Prediction.

Disentangling Propagation and Generation for Video Prediction

Predicting Diverse Future Frames with Local Transformation-Guided Masking.

Video Frame Prediction with Dual-Stream Deep Network Emphasizing Motions and Content Details.

A lightweight multi-granularity asymmetric motion mode video frame prediction algorithm

Exploring and Exploiting High-Order Spatial-Temporal Dynamics for Long-Term Frame Prediction

Video Prediction Via Selective Sampling

STIP: A SpatioTemporal Information-Preserving and Perception-Augmented Model for High-Resolution Video Prediction

State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend

Video Prediction via Example Guidance

Progressive prediction: Video anomaly detection via multi‐grained prediction

Priority Belief Propagation-Based Inpainting Prediction with Tensor Voting Projected Structure in Video Compression

MMVP: Motion-Matrix-based Video Prediction

Motion-Aware Feature Enhancement Network for Video Prediction

Video Frame Prediction by Deep Multi-Branch Mask Network

Long-Term Video Prediction Via Criticization and Retrospection

Optimizing Video Prediction via Video Frame Interpolation