Abstract:Weakly-supervised video anomaly detection is typically formulated as a multi-instance learning problem, assigning anomaly scores to each video snippet by learning to rank with only video-level labels. However, previous approaches that rely on snippet-level embeddings generated by task-agnostic feature extractors inevitably encounter challenges such as intra-bag similarities and frame-level entanglement. As a result, the model may exhibit significant performance degradation, particularly when the target is looming or receding. To address these issues, we present a novel weakly-supervised hierarchical position-scale awareness model that incorporates heterogeneous cross-scale correlation learning to improve detection performance. Specifically, in addition to snippet-level embedding, we employ an object detector (e.g., YOLOv5) for frame-level target detection and perform normal target clustering. By introducing a hierarchical ranking strategy, we gradually disentangle potential anomaly targets from the frame to the snippet level. Subsequently, we design a simple yet efficient position-scale awareness inference method that predicts the spatial positions and scales of looming and receding targets based on the abnormal targets with high confidence in adjacent snippets. Furthermore, we introduce heterogeneous cross-scale correlation learning to acquire the correlation between targets and snippet embeddings, enabling our model to increase attention to anomaly-related targets. Compared to previous approaches that generate only anomaly scores for each snippet, our method can locate anomalous targets, making it more suitable for practical applications. Without bells and whistles, evaluations on commonly-used VAD benchmarks: ShanghaiTech, UCSD-Ped2, Avenue, UCF-Crime and UBnormal datasets show that our method yields competitive and highly promising results compared with existing unsupervised, self-supervised, and weakly-supervised competitors. The code will be made publicly available.

Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos Via Scene-enhanced Video Large Language Model

Hawk: Learning to Understand Open-World Video Anomalies

Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection

Joint inference of groups, events and human roles in aerial videos

Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding & Contextual Label Affinity

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

Delving into CLIP latent space for Video Anomaly Recognition

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Anomalies cannot materialize or vanish out of thin air: A hierarchical multiple instance learning with position-scale awareness for video anomaly detection

Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos

Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog

Online Anomaly Detection over Live Social Video Streaming

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Automatic association of chats and video tracks for activity learning and recognition in aerial video surveillance

Cognition Guided Video Anomaly Detection Framework for Surveillance Services