Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection

Xiaofeng Tan,Hongsong Wang,Xin Geng
2024-12-04
Abstract:Video anomaly detection is an essential yet challenging open-set task in computer vision, often addressed by leveraging reconstruction as a proxy task. However, existing reconstruction-based methods encounter challenges in two main aspects: (1) limited model robustness for open-set scenarios, (2) and an overemphasis on, but restricted capacity for, detailed motion reconstruction. To this end, we propose a novel frequency-guided diffusion model with perturbation training, which enhances the model robustness by perturbation training and emphasizes the principal motion components guided by motion frequencies. Specifically, we first use a trainable generator to produce perturbative samples for perturbation training of the diffusion model. During the perturbation training phase, the model robustness is enhanced and the domain of the reconstructed model is broadened by training against this generator. Subsequently, perturbative samples are introduced for inference, which impacts the reconstruction of normal and abnormal motions differentially, thereby enhancing their separability. Considering that motion details originate from high-frequency information, we propose a masking method based on 2D discrete cosine transform to separate high-frequency information and low-frequency information. Guided by the high-frequency information from observed motion, the diffusion model can focus on generating low-frequency information, and thus reconstructing the motion accurately. Experimental results on five video anomaly detection datasets, including human-related and open-set benchmarks, demonstrate the effectiveness of the proposed method. Our code is available at <a class="link-external link-https" href="https://github.com/Xiaofeng-Tan/FGDMAD-Code" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in video anomaly detection (VAD): 1. **Insufficient model robustness**: Existing reconstruction - based methods perform poorly in open - set scenarios mainly because the models lack robustness to unseen normal samples. These methods usually use consistent inputs and outputs to learn normal patterns, which may lead the model to learn shortcuts. When normal motion is perturbed, the model is difficult to reconstruct using the learned shortcuts, resulting in misclassification. 2. **Over - focus on detailed motion reconstruction**: Existing methods do not distinguish between the principal components and detailed information of motion when processing them. From the perspective of signal processing, the principal components and detailed information can be represented as low - frequency and high - frequency information respectively. It is relatively easy to generate approximate motion, but it is very difficult to accurately reconstruct the details of these motions because the diversity of personal habits will lead to changes in high - frequency information. To solve these problems, the authors propose a new **Frequency - Guided Diffusion Model with Perturbation Training**. The main contributions of this method are as follows: - **Enhancing model robustness through perturbation training**: Introduce a trainable Perturbative Example Generator to generate perturbed samples for perturbation training. Through adversarial training, the diffusion model can become robust on perturbed normal motion and enhance the separability between normal and abnormal events. - **Frequency - guided denoising process**: Use two - dimensional discrete cosine transform (2D DCT) to decompose motion into high - frequency and low - frequency information. Guide the diffusion model to focus on generating low - frequency information by observing high - frequency information, so as to more accurately reconstruct motion. Specifically, the workflow of this method includes: 1. **Perturbation generation**: Use the perturbation generator to generate perturbed samples to expand the learning domain of the model. 2. **Perturbation training**: Alternately optimize the perturbation generator and the noise predictor so that the model can handle perturbed samples. 3. **Frequency information extraction and fusion**: In the inference stage, fuse the low - frequency information of the generated motion and the observed high - frequency information to improve the reconstruction quality. Experimental results show that this method significantly outperforms the existing state - of - the - art methods on five public VAD datasets, especially in open - set benchmark tests. ### Formula summary - Perturbation generation formula: \[ \delta=\lambda\cdot\text{sign}(\nabla_\theta L(x,\theta)) \] \[ \hat{x}=x + \delta \] - Noise addition process: \[ \sqrt{\bar{\alpha}_t}x+\sqrt{1-\bar{\alpha}_t}\epsilon=x_t \] - Noise prediction loss: \[ L(x,\theta)=\mathbb{E}_{x,t}[\|\epsilon-\epsilon_\theta(x_t,t,c)\|_2^2] \] - Frequency information extraction: \[ y = \text{DCT}(\bar{x})=D\bar{x} \] \[ \bar{x}=\text{iDCT}(y)=D^T y \] - Frequency information fusion: \[ y_c^t=y_o^t\odot M_h(y_o^t)+y_g^t\odot M_l(y_g^t) \] \[ \bar{x}_c^t=\text{iDCT}(y_c^t) \] Through these improvements, this method achieves better performance in the video anomaly detection task, especially in open - set scenarios.