Abstract:This three‐level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. This is the first effort to progressively combine three‐level predictions from coarse to fine‐grained for VAD. We demonstrate the effectiveness of our framework by conducting an extensive experimental evaluation on the four publicly large‐scale benchmark datasets in both micro‐AUC and macro‐AUC metrics. Video Anomaly Detection (VAD) has been an active research field for several decades. However, most existing approaches merely extract a single type of feature from videos and define a single paradigm to indicate the extent of abnormalities. A coarse‐to‐fine three‐level prediction is built by integrating different levels of spatio‐temporal representations, better highlighting the difference between normal and abnormal behaviors. First, an object‐level trajectory prediction is proposed to model human historical position using a graph transformer network. Subsequently, skeleton‐level prediction is achieved by incorporating the positional information from the trajectory prediction. More importantly, based on the predicted skeleton, a skeleton‐guided pixel‐level region prediction is performed. A novel Skeleton Conditioned Generative Adversarial Network (SCGAN) is designed to explore the correlation between skeleton‐level and pixel‐level motion prediction. Benefiting from SCGAN, the prediction of human regions is contributed by both coarse‐grained and fine‐grained motion features. This three‐level prediction, namely Progressive Prediction Video Anomaly Detection (P3VAD), enlarges the prediction error on irregular motion patterns. Besides, a pixel‐level analysis method is proposed to achieve Background‐bias Elimination (BE) and denoise the predicted region. Experimental results validate the effectiveness of P3VAD on the four benchmark datasets (ShanghaiTech, CUHK Avenue, IITB‐Corridor, and ADOC).

Long-Term Video Prediction Via Criticization and Retrospection

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Long-Term Prediction of Natural Video Sequences with Robust Video Predictors

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

+ YY ’ Z X-Residual Prediction Error TargetPredictionObservation Latent 0 State φ f 1 f 2

STIP: A SpatioTemporal Information-Preserving and Perception-Augmented Model for High-Resolution Video Prediction

State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend

Predicting Diverse Future Frames with Local Transformation-Guided Masking.

Flexible Spatio-Temporal Networks for Video Prediction

Predicting Long-horizon Futures by Conditioning on Geometry and Time

Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Progressive prediction: Video anomaly detection via multi‐grained prediction

PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction

Comprehensive Regularization in a Bi-directional Predictive Network for Video Anomaly Detection

Continual Predictive Learning from Videos

Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Future Frame Prediction for Anomaly Detection -- A New Baseline

Optimizing Video Prediction via Video Frame Interpolation

Future Video Prediction from a Single Frame for Video Anomaly Detection

Looking-Ahead: Neural Future Video Frame Prediction