CLIP-Driven Multi-Scale Instance Learning for Weakly Supervised Video Anomaly Detection

Zhangbin Qian,Jiawei Tan,Zhilong Ou,Hongxing Wang
DOI: https://doi.org/10.1109/icme57554.2024.10687724
2024-01-01
Abstract:Existing weakly supervised video anomaly detection methods mainly employ Multiple Instance Learning (MIL) to identify abnormal snippets in untrimmed videos. However, the semantics and presentations of anomalies frequently exhibit ambiguity that MIL is difficult to tackle. Moreover, MIL suffers from false alarms due to its independent optimization of each instance, neglecting temporal correlation between adjacent snippets. Consequently, we badly need to better connect abnormal presentations and their semantics, as well as to enable multi-temporal-scale anomaly discovery. This paper proposes a CLIP-Driven Multi-Scale Instance Learning (CMSIL) framework with two branches including Vision-Language (VL) and Multi-Scale Instance Learning (MSIL). The VL branch leverages the powerful visual concept priors from Contrastive Language-Image Pre-training (CLIP) to generate pseudo anomalies, thereby providing suspected anomaly cues for model training guidance. The MSIL branch utilizes a feature pyramid to fully mine fine-grained temporal dependencies by employing MIL within each pyramid level to learn anomalous patterns across different temporal scales. By collaborating with the two branches, the proposed CMSIL shows better proficiency in handling anomalies with varying durations. Extensive experiments on the XD-Violence and UCF-Crime datasets demonstrate the superior performance of our method. The code is available at https://github.com/casperZB/CMSIL.
What problem does this paper attempt to address?