VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

Muchao Ye,Weiyang Liu,Pan He
2024-12-02
Abstract:The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.
Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enable pre - trained vision - language models (VLMs) to perform video anomaly detection (VAD) and provide interpretable results without modifying model parameters and without the need for additional inference modules. Existing methods usually rely on additional inference modules or instruction - tuning datasets, which will lead to significant computational costs and data - annotation overheads. To solve these problems, the authors propose a framework named VERA. ### Main Problems and Challenges 1. **Computational Costs and Data - Annotation Overheads** - Existing methods improve the inference capabilities of VLMs by introducing additional inference modules or using instruction - tuning datasets, but this will lead to high computational costs and data - annotation overheads. 2. **Handling of Complex Inference Tasks** - The VAD task requires complex inference capabilities, and pre - trained VLMs have limited performance in handling such complex tasks, especially without additional tuning. 3. **Explanatory Requirements** - In addition to accurately detecting abnormal events, the VAD system also needs to provide clear explanations to help users understand the detection results. ### Solution: VERA Framework The VERA framework solves the above problems in the following ways: - **Learning - Guided Questions**: VERA decomposes the complex VAD inference task into a series of simpler and more targeted guided questions. These questions are learned from coarsely - labeled datasets in a data - driven manner, avoiding the need for manual design and fine - grained labeling. - **No Parameter Modification**: VERA can improve the inference capabilities of VLMs by optimizing the guided questions without modifying the parameters of VLMs. - **Efficient Inference**: In the inference stage, VERA uses the learned guided questions to generate segment - level anomaly scores and refines them into frame - level scores by fusing scene and temporal contexts, thereby achieving efficient anomaly detection. ### Specific Steps 1. **Training Stage** - **Objective**: Learn guided questions that can decompose complex anomaly patterns. - **Method**: Iteratively optimize the guided questions through the interaction between the learner and the optimizer. The learner generates predictions based on the current guided questions, and the optimizer adjusts the guided questions according to the prediction results to make them more effective. 2. **Inference Stage** - **Initial Anomaly Scoring**: Divide the video into multiple segments and apply the learned guided questions to each segment to generate an initial anomaly score. - **Fusing Scene Context**: Further optimize the segment - level anomaly score by considering the relevance of adjacent segments. - **Fusing Temporal Context**: Apply Gaussian smoothing and position weighting to refine the segment - level score into a frame - level score. ### Contributions - Proposed the first framework VERA that can be adapted to the VAD task without instruction - tuning or additional inference modules. - Introduced an effective language - learning - based algorithm that allows direct adaptation of VLMs for the VAD task. - Designed a coarse - to - fine strategy that improves VAD performance and inference capabilities by fusing scene and temporal contexts. Through these innovations, VERA not only improves the performance of video anomaly detection but also provides better interpretability, making the results more easily understood and trusted by users.