Abstract:Weakly-supervised video anomaly detection is typically formulated as a multi-instance learning problem, assigning anomaly scores to each video snippet by learning to rank with only video-level labels. However, previous approaches that rely on snippet-level embeddings generated by task-agnostic feature extractors inevitably encounter challenges such as intra-bag similarities and frame-level entanglement. As a result, the model may exhibit significant performance degradation, particularly when the target is looming or receding. To address these issues, we present a novel weakly-supervised hierarchical position-scale awareness model that incorporates heterogeneous cross-scale correlation learning to improve detection performance. Specifically, in addition to snippet-level embedding, we employ an object detector (e.g., YOLOv5) for frame-level target detection and perform normal target clustering. By introducing a hierarchical ranking strategy, we gradually disentangle potential anomaly targets from the frame to the snippet level. Subsequently, we design a simple yet efficient position-scale awareness inference method that predicts the spatial positions and scales of looming and receding targets based on the abnormal targets with high confidence in adjacent snippets. Furthermore, we introduce heterogeneous cross-scale correlation learning to acquire the correlation between targets and snippet embeddings, enabling our model to increase attention to anomaly-related targets. Compared to previous approaches that generate only anomaly scores for each snippet, our method can locate anomalous targets, making it more suitable for practical applications. Without bells and whistles, evaluations on commonly-used VAD benchmarks: ShanghaiTech, UCSD-Ped2, Avenue, UCF-Crime and UBnormal datasets show that our method yields competitive and highly promising results compared with existing unsupervised, self-supervised, and weakly-supervised competitors. The code will be made publicly available.

Discriminatively Trained Latent Ordinal Model for Video Classification

Learning Structured Ordinal Measures for Video based Face Recognition

Self-trained Deep Ordinal Regression for End-to-End Video Anomaly Detection

Joint Structured Sparsity Regularized Multiview Dimension Reduction for Video-Based Facial Expression Recognition.

Latent Fisher Discriminant Analysis

Discriminative Video Representation with Temporal Order for Micro-expression Recognition

Robust Attribute-Based Visual Recognition Using Discriminative Latent Representation.

Anomalies cannot materialize or vanish out of thin air: A hierarchical multiple instance learning with position-scale awareness for video anomaly detection

Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes

Latent Bi-Constraint SVM for Video-Based Object Recognition

Learning Implicit Temporal Alignment for Few-shot Video Classification

Self-supervised Temporal Discriminative Learning for Video Representation Learning

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Active Learning for Video Classification with Frame Level Queries

Deep Domain Adaptation for Ordinal Regression of Pain Intensity Estimation Using Weakly-Labelled Videos

Inter-intra Variant Dual Representations forSelf-supervised Video Recognition

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Discriminative Regression With Latent Label Learning for Image Classification

Deep Multimodal Learning: An Effective Method for Video Classification