Abstract:The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enable pre - trained vision - language models (VLMs) to perform video anomaly detection (VAD) and provide interpretable results without modifying model parameters and without the need for additional inference modules. Existing methods usually rely on additional inference modules or instruction - tuning datasets, which will lead to significant computational costs and data - annotation overheads. To solve these problems, the authors propose a framework named VERA. ### Main Problems and Challenges 1. **Computational Costs and Data - Annotation Overheads** - Existing methods improve the inference capabilities of VLMs by introducing additional inference modules or using instruction - tuning datasets, but this will lead to high computational costs and data - annotation overheads. 2. **Handling of Complex Inference Tasks** - The VAD task requires complex inference capabilities, and pre - trained VLMs have limited performance in handling such complex tasks, especially without additional tuning. 3. **Explanatory Requirements** - In addition to accurately detecting abnormal events, the VAD system also needs to provide clear explanations to help users understand the detection results. ### Solution: VERA Framework The VERA framework solves the above problems in the following ways: - **Learning - Guided Questions**: VERA decomposes the complex VAD inference task into a series of simpler and more targeted guided questions. These questions are learned from coarsely - labeled datasets in a data - driven manner, avoiding the need for manual design and fine - grained labeling. - **No Parameter Modification**: VERA can improve the inference capabilities of VLMs by optimizing the guided questions without modifying the parameters of VLMs. - **Efficient Inference**: In the inference stage, VERA uses the learned guided questions to generate segment - level anomaly scores and refines them into frame - level scores by fusing scene and temporal contexts, thereby achieving efficient anomaly detection. ### Specific Steps 1. **Training Stage** - **Objective**: Learn guided questions that can decompose complex anomaly patterns. - **Method**: Iteratively optimize the guided questions through the interaction between the learner and the optimizer. The learner generates predictions based on the current guided questions, and the optimizer adjusts the guided questions according to the prediction results to make them more effective. 2. **Inference Stage** - **Initial Anomaly Scoring**: Divide the video into multiple segments and apply the learned guided questions to each segment to generate an initial anomaly score. - **Fusing Scene Context**: Further optimize the segment - level anomaly score by considering the relevance of adjacent segments. - **Fusing Temporal Context**: Apply Gaussian smoothing and position weighting to refine the segment - level score into a frame - level score. ### Contributions - Proposed the first framework VERA that can be adapted to the VAD task without instruction - tuning or additional inference modules. - Introduced an effective language - learning - based algorithm that allows direct adaptation of VLMs for the VAD task. - Designed a coarse - to - fine strategy that improves VAD performance and inference capabilities by fusing scene and temporal contexts. Through these innovations, VERA not only improves the performance of video anomaly detection but also provides better interpretability, making the results more easily understood and trusted by users.

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

Video Anomaly Detection and Explanation via Large Language Models

Harnessing Large Language Models for Training-free Video Anomaly Detection

Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Open-Vocabulary Video Anomaly Detection

Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models

Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection

Retrieval-based Video Language Model for Efficient Long Video Question Answering

VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

RoRA-VLM: Robust Retrieval-Augmented Vision Language Models

Toward Video Anomaly Retrieval From Video Anomaly Detection: New Benchmarks and Model

ViLLa: Video Reasoning Segmentation with Large Language Model

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Zelda: Video Analytics using Vision-Language Models

Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

Verbalized Machine Learning: Revisiting Machine Learning with Language Models

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators