CViT: A Convolution Vision Transformer for Video Abnormal Behavior Detection and Localization

Sanjay Roka,Manoj Diwakar,Roka, Sanjay,Diwakar, Manoj
DOI: https://doi.org/10.1007/s42979-023-02294-y
2023-10-29
SN Computer Science
Abstract:Video anomaly detection is a critical task because of the rare, irregular, and unbounded nature of abnormal events. Currently, most approaches only rely on CNN for such tasks, but due to spatial inductive bias, it can extract only local features from images which is insufficient for video anomaly detection. Recently, transformer-based approaches are getting popular due to their global self-attention mechanism and are considered alternatives to CNN convolution for sequence-to-sequence anomaly detection. Unfortunately, because of a lack of inadequate low-level information, it has limited localization abilities. In this paper, we have proposed a new approach using the CViT block. We design our approach by fusing U-Net and transformer and modified encoder by stacking the CViT block one after the other. This type of combination permits our model to extract richer local and global features from RGB frames. Our approach contains two modules: anomaly detection module is used to detect abnormal frames using PSNR and anomaly score. Whereas the anomaly localization module accepts only a list of abnormal frames and contains the object detection algorithm YOLO to highlight abnormal objects. Our approach was first evaluated by our own custom dataset GEU and for comparison, we use standard benchmark datasets UCSD, CUHK Avenue, and ShanghaiTech. Comparative results depict better performance of our approach in detecting abnormal events.
What problem does this paper attempt to address?