Abstract:Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5\%, \hl{90.4}\%, 94.4\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: \url{<a class="link-external link-https" href="https://github.com/shiwoaz/lap" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in weakly - supervised video anomaly detection (WS - VAD): 1. **Ambiguous anomaly definition**: Existing WS - VAD methods usually rely on multiple instance learning (MIL). When distinguishing between normal and abnormal segments, these methods often do not specify the specific types of anomalies. This ambiguity may lead to problems such as a high false - positive rate and low accuracy. 2. **Lack of semantic information**: Traditional MIL methods mainly focus on the visual modality and ignore the rich semantic information contained in the text description. This makes it difficult for the model to understand complex abnormal events, especially the variability in different scenarios. 3. **Insufficient self - supervised labels**: Existing methods lack an effective pseudo - label generation mechanism and cannot fully utilize unlabeled data for self - supervised training, thus limiting the performance improvement of the model. To solve these problems, the author proposes a new framework - **Learning Suspected Anomalies from Event Prompts (LAP)**. This framework combines text descriptions with video content by introducing an event - prompt dictionary to guide the model to more accurately identify abnormal events. Specifically, the main contributions of the LAP framework include: - **Introducing text prompts**: Through text prompts that describe abnormal events, it helps the model better understand the specific manifestations of anomalies, thereby improving performance on open - set and cross - database problems. - **Multi - prompt learning strategy**: A new multi - prompt learning strategy is proposed, enabling the model to comprehensively understand normal and abnormal patterns across multiple videos, rather than being limited to a single video. - **Pseudo - abnormal label generation**: Based on the semantic similarity between event prompts and videos, additional pseudo - abnormal labels are mined for self - supervised training, further enhancing the detection ability of the model. ### Formula summary 1. **Feature synthesis**: \[ F_a=\theta(V_a, T_a),\quad F_n = \theta(V_n, T_n) \] where $\theta$ represents the feature alignment and fusion operation, which can be concatenation or addition. 2. **Multi - prompt learning loss**: \[ L_{\text{MPL}}=\max\left(\|f_{\text{anc}} - f_{\text{pos}}\|^2-\|f_{\text{anc}} - f_{\text{neg}}\|^2+\alpha, 0\right) \] where $\alpha$ is the margin coefficient. 3. **Pseudo - abnormal label loss**: \[ L_{\text{PAL}}=\sum_{i = 1}^N-\left(p[i]\log(s_a[i])+(1 - p[i])\log(1 - s_a[i])\right) \] 4. **Total loss function**: \[ L_{\text{LAP}}=L_{\text{MIL}}+\beta L_{\text{MPL}}+\gamma L_{\text{PAL}} \] where $\beta$ and $\gamma$ are hyperparameters. Through these improvements, the LAP framework has achieved significant performance improvements on multiple benchmark datasets, especially surpassing existing methods in metrics such as average precision (AP) and area under the curve (AUC).

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Anomalies cannot materialize or vanish out of thin air: A hierarchical multiple instance learning with position-scale awareness for video anomaly detection

Prompt-Enhanced Multiple Instance Learning for Weakly Supervised Video Anomaly Detection

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection

Event-driven Weakly Supervised Video Anomaly Detection

Weakly Supervised Anomaly Detection in Videos Considering the Openness of Events

A New Comprehensive Benchmark for Semi-supervised Video Anomaly Detection and Anticipation

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model

Open-Vocabulary Video Anomaly Detection

Injecting Text Clues for Improving Anomalous Event Detection From Weakly Labeled Videos

Cognition Guided Video Anomaly Detection Framework for Surveillance Services

Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

Video Anomaly Detection Based on Spatio-Temporal Relationships among Objects

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Contrastive Attention for Video Anomaly Detection

Localizing Anomalies From Weakly-Labeled Videos

Learning Attention Augmented Spatial-temporal Normality for Video Anomaly Detection

Toward Video Anomaly Retrieval From Video Anomaly Detection: New Benchmarks and Model