Abstract:Weakly supervised video anomaly detection (WS-VAD) is often formulated as a multiple instance learning (MIL) problem. Snippet-level anomaly scores can be predicted using only video-level annotations, but most MIL approaches focus on improving the performance of the feature learning network and ignore the method design of the preprocessing stage. MIL-based methods usually preprocess videos of different lengths into a predefined number of snippets for later anomaly identification. This is impractical for real-world videos of varying lengths when the duration of anomalous events is unknown in training. Data with different temporal resolutions generated by this division confuses the network and leads to limited detection capability. To address this issue, we propose a novel WS-VAD method. First, a temporal resolution feature mapping module (TRFM) improves the network’s learning ability for input data with different temporal resolutions by mapping the temporal resolution information into the feature learning space. We also introduce a gated recurrent unit (GRU)-based multi-scale temporal feature learning module (MS-GRU), combining GRUs with multi-scale convolutional structures and fusing features recursively at different time scales. This module exploits the ability of GRUs to extract temporal information and compensates for the fact that GRUs only extract single-scale temporal dependence. In addition, we propose the Adaptive-k module to optimize the original Top-k loss and increase flexibility in training by using the optimal number of anomalous segments k generated according to the different inputs. This approach is fully applicable to real-world videos of various lengths. Experimental results show that our model boosts the detection accuracy for data with enormous differences in temporal resolution and obtains state-of-the-art frame-level AUC performance on three real-world surveillance datasets: UCF-Crime, ShanghaiTech and XD-violence datasets.

Effective Video Abnormal Event Detection by Learning A Consistency-Aware High-Level Feature Extractor

Video Abnormal Event Detection by Learning to Complete Visual Cloze Tests

Normality learning reinforcement for anomaly detection in surveillance videos

Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events

Video Anomaly Detection via Visual Cloze Tests

Collaborative Normality Learning Framework for Weakly Supervised Video Anomaly Detection

Sensing Anomalies Like Humans: A Hominine Framework to Detect Abnormal Events from Unlabeled Videos

Enhanced Memory Adversarial Network for Anomaly Detection

Memory-Augmented Spatial-Temporal Consistency Network for Video Anomaly Detection.

Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos

Video Anomaly Detection Based on Global–Local Convolutional Autoencoder

FE-VAD: High-Low Frequency Enhanced Weakly Supervised Video Anomaly Detection

Exploiting Spatial-temporal Correlations for Video Anomaly Detection

Rethinking Prediction-Based Video Anomaly Detection from Local-Global Normality Perspective

Semantic-driven Dual Consistency Learning for Weakly Supervised Video Anomaly Detection

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Abnormal Event Detection Via Feature Expectation Subgraph Calibrating Classification in Video Surveillance Scenes.

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Spatiotemporal consistency-enhanced network for video anomaly detection

Video Anomaly Detection Based on Cross-Frame Prediction Mechanism and Spatio-Temporal Memory-Enhanced Pseudo-3D Encoder.

Weakly-supervised Video Anomaly Detection Via Temporal Resolution Feature Learning.