Abstract:Weakly-supervised video anomaly detection is typically formulated as a multi-instance learning problem, assigning anomaly scores to each video snippet by learning to rank with only video-level labels. However, previous approaches that rely on snippet-level embeddings generated by task-agnostic feature extractors inevitably encounter challenges such as intra-bag similarities and frame-level entanglement. As a result, the model may exhibit significant performance degradation, particularly when the target is looming or receding. To address these issues, we present a novel weakly-supervised hierarchical position-scale awareness model that incorporates heterogeneous cross-scale correlation learning to improve detection performance. Specifically, in addition to snippet-level embedding, we employ an object detector (e.g., YOLOv5) for frame-level target detection and perform normal target clustering. By introducing a hierarchical ranking strategy, we gradually disentangle potential anomaly targets from the frame to the snippet level. Subsequently, we design a simple yet efficient position-scale awareness inference method that predicts the spatial positions and scales of looming and receding targets based on the abnormal targets with high confidence in adjacent snippets. Furthermore, we introduce heterogeneous cross-scale correlation learning to acquire the correlation between targets and snippet embeddings, enabling our model to increase attention to anomaly-related targets. Compared to previous approaches that generate only anomaly scores for each snippet, our method can locate anomalous targets, making it more suitable for practical applications. Without bells and whistles, evaluations on commonly-used VAD benchmarks: ShanghaiTech, UCSD-Ped2, Avenue, UCF-Crime and UBnormal datasets show that our method yields competitive and highly promising results compared with existing unsupervised, self-supervised, and weakly-supervised competitors. The code will be made publicly available.

Pose-Motion Video Anomaly Detection via Memory-Augmented Reconstruction and Conditional Variational Prediction

Learning Appearance-Motion Synergy Via Memory-Guided Event Prediction for Video Anomaly Detection

Learning Appearance-motion Normality for Video Anomaly Detection.

Memory Enhanced Spatial-Temporal Graph Convolutional Autoencoder for Human-Related Video Anomaly Detection.

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

Video anomaly detection based on a multi-layer reconstruction autoencoder with a variance attention strategy

Video Anomaly Detection Via Successive Image Frame Prediction Leveraging Optical Flows

A Novel Unsupervised Video Anomaly Detection Framework Based on Optical Flow Reconstruction and Erased Frame Prediction

Memory-Augmented Spatial-Temporal Consistency Network for Video Anomaly Detection.

Video Anomaly Detection Based on Spatio-Temporal Relationships among Objects

Research on Video Anomaly Detection Based on Cascaded Memory-augmented Autoencoder

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Memory-enhanced appearance-motion consistency framework for video anomaly detection

Video Anomaly Detection By The Duality Of Normality-Granted Optical Flow

Generate anomalies from normal: a partial pseudo-anomaly augmented approach for video anomaly detection

Spatiotemporal Masked Autoencoder with Multi-Memory and Skip Connections for Video Anomaly Detection

Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

Anomalies cannot materialize or vanish out of thin air: A hierarchical multiple instance learning with position-scale awareness for video anomaly detection

Cognition Guided Video Anomaly Detection Framework for Surveillance Services

Video Anomaly Detection Based on Global–Local Convolutional Autoencoder

Rethinking Prediction-Based Video Anomaly Detection from Local-Global Normality Perspective