Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Chenchen Tao,Xiaohao Peng,Chong Wang,Jiafei Wu,Puning Zhao,Jun Wang,Jiangbo Qian
2024-09-03
Abstract:Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5\%, \hl{90.4}\%, 94.4\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: \url{<a class="link-external link-https" href="https://github.com/shiwoaz/lap" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in weakly - supervised video anomaly detection (WS - VAD): 1. **Ambiguous anomaly definition**: Existing WS - VAD methods usually rely on multiple instance learning (MIL). When distinguishing between normal and abnormal segments, these methods often do not specify the specific types of anomalies. This ambiguity may lead to problems such as a high false - positive rate and low accuracy. 2. **Lack of semantic information**: Traditional MIL methods mainly focus on the visual modality and ignore the rich semantic information contained in the text description. This makes it difficult for the model to understand complex abnormal events, especially the variability in different scenarios. 3. **Insufficient self - supervised labels**: Existing methods lack an effective pseudo - label generation mechanism and cannot fully utilize unlabeled data for self - supervised training, thus limiting the performance improvement of the model. To solve these problems, the author proposes a new framework - **Learning Suspected Anomalies from Event Prompts (LAP)**. This framework combines text descriptions with video content by introducing an event - prompt dictionary to guide the model to more accurately identify abnormal events. Specifically, the main contributions of the LAP framework include: - **Introducing text prompts**: Through text prompts that describe abnormal events, it helps the model better understand the specific manifestations of anomalies, thereby improving performance on open - set and cross - database problems. - **Multi - prompt learning strategy**: A new multi - prompt learning strategy is proposed, enabling the model to comprehensively understand normal and abnormal patterns across multiple videos, rather than being limited to a single video. - **Pseudo - abnormal label generation**: Based on the semantic similarity between event prompts and videos, additional pseudo - abnormal labels are mined for self - supervised training, further enhancing the detection ability of the model. ### Formula summary 1. **Feature synthesis**: \[ F_a=\theta(V_a, T_a),\quad F_n = \theta(V_n, T_n) \] where $\theta$ represents the feature alignment and fusion operation, which can be concatenation or addition. 2. **Multi - prompt learning loss**: \[ L_{\text{MPL}}=\max\left(\|f_{\text{anc}} - f_{\text{pos}}\|^2-\|f_{\text{anc}} - f_{\text{neg}}\|^2+\alpha, 0\right) \] where $\alpha$ is the margin coefficient. 3. **Pseudo - abnormal label loss**: \[ L_{\text{PAL}}=\sum_{i = 1}^N-\left(p[i]\log(s_a[i])+(1 - p[i])\log(1 - s_a[i])\right) \] 4. **Total loss function**: \[ L_{\text{LAP}}=L_{\text{MIL}}+\beta L_{\text{MPL}}+\gamma L_{\text{PAL}} \] where $\beta$ and $\gamma$ are hyperparameters. Through these improvements, the LAP framework has achieved significant performance improvements on multiple benchmark datasets, especially surpassing existing methods in metrics such as average precision (AP) and area under the curve (AUC).