Abstract:Weakly supervised video anomaly detection (WS-VAD) is often formulated as a multiple instance learning (MIL) problem. Snippet-level anomaly scores can be predicted using only video-level annotations, but most MIL approaches focus on improving the performance of the feature learning network and ignore the method design of the preprocessing stage. MIL-based methods usually preprocess videos of different lengths into a predefined number of snippets for later anomaly identification. This is impractical for real-world videos of varying lengths when the duration of anomalous events is unknown in training. Data with different temporal resolutions generated by this division confuses the network and leads to limited detection capability. To address this issue, we propose a novel WS-VAD method. First, a temporal resolution feature mapping module (TRFM) improves the network’s learning ability for input data with different temporal resolutions by mapping the temporal resolution information into the feature learning space. We also introduce a gated recurrent unit (GRU)-based multi-scale temporal feature learning module (MS-GRU), combining GRUs with multi-scale convolutional structures and fusing features recursively at different time scales. This module exploits the ability of GRUs to extract temporal information and compensates for the fact that GRUs only extract single-scale temporal dependence. In addition, we propose the Adaptive-k module to optimize the original Top-k loss and increase flexibility in training by using the optimal number of anomalous segments k generated according to the different inputs. This approach is fully applicable to real-world videos of various lengths. Experimental results show that our model boosts the detection accuracy for data with enormous differences in temporal resolution and obtains state-of-the-art frame-level AUC performance on three real-world surveillance datasets: UCF-Crime, ShanghaiTech and XD-violence datasets.

Enhancing Feature Representation for Anomaly Detection Via Local-and-Global Temporal Relations and a Multi-stage Memory.

Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning

Weakly-supervised Video Anomaly Detection Via Temporal Resolution Feature Learning.

Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning

Learning Causal Temporal Relation and Feature Discrimination for Anomaly Detection.

Weakly supervised anomaly detection with multi-level contextual modeling

A novel spatio-temporal memory network for video anomaly detection

Video Anomaly Detection with Multi-Scale Feature and Temporal Information Fusion

TFAE: temporal feature adjustable enhancement for video anomaly detection

Multi-Scale Temporal Relations and Segmented Channel Attention for Video Anomaly Detection

Enhanced Memory Adversarial Network for Anomaly Detection

A Two-Branch Network for Video Anomaly Detection with Spatio-Temporal Feature Learning

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection

Multimodal and multiscale feature fusion for weakly supervised video anomaly detection

MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos

Memory-Augmented Spatial-Temporal Consistency Network for Video Anomaly Detection.

Learning Task-Specific Representation for Video Anomaly Detection with Spatial-Temporal Attention

Weakly Supervised Video Anomaly Detection Via Self-Guided Temporal Discriminative Transformer.

Exploiting Spatial-temporal Correlations for Video Anomaly Detection

Weakly Supervised Video Anomaly Detection Via Transformer-Enabled Temporal Relation Learning