Video Anomaly Detection by Fusing Self-Attention and Autoencoder
Liang Jiafei,Li Ting,Yang Jiaqi,Li Yanan,Fang Zhiwen,Yang Feng
DOI: https://doi.org/10.11834/jig.211147
2023-01-01
Abstract:Objective Anomaly detection has been developing in video surveillance domain. Video anomaly detection is focused on motions-irregular detection and extraction in relevant to long-distance rehabilitation motion analysis. But, it is challenged to obtain training samples that include all types of abnormal events. Therefore, existing anomaly detection methods in videos usually train a model on datasets, which contain normal samples only. In the testing phase, the events whose patterns are different from normal patterns are detected as abnormities. To represent the normal motion patterns in videos, early works are based on hand-crafted feature and concerned about low-level trajectory features. However, it is challenged to get effective trajectory features in complicated scenarios. Spatial-temporal features like the histogram of oriented flows(HOF) and the histogram of oriented gradients(HOG) are commonly used as representations of motion and content in anomaly detection. To model the motion and appearance patterns in anomaly detection, spatial-temporal features-based Markov random field(MRF), the mixture of probabilistic PCA(MPPCA), and the Gaussian mixture model are employed.Based on the assumption that normal patterns can be represented via linear combinations in dictionaries, sparse coding and dictionary learning can be used to encode normal patterns. Due to the insufficient descriptive power of hand-craft features, the robustness of these models is still poor in multiple scenarios. Currently, autoencoder-based deep learning methods are introduced in video anomaly detection. A 3D convolutional Auto-Encoder is designed to model normal patterns in regular frames. A convolutional long short term memory(LSTM) Auto-Encoder is developed to model normal appearance and motion patterns simultaneously in terms of the incorporation between convolutional neural network(CNN) and LSTM. To learn the sparse representation and dictionary of normal patterns, an adaptive iterative hard-thresholding algorithm is designed within an LSTM framework in according to the strong performance of sparse coding-based anomaly detection.Autoencoder-based prediction networks are introduced into anomaly detection in contrast to reconstruction-based models, which can detect anomalies through error computing between predicted frames and ground truth frames. Additionally, to process spatial-temporal information of different scales, a convolutional gate recurrent unit(ConvGRU) based multipath frame prediction network is demonstrated. Due to the blindness of self-supervised learning in anomaly detection, CNNsbased methods have their limitations in mining normal patterns. To improve the capability of feature expression, the vision transformer(ViT) model can used to extend the Transformer from natural language processing to the image domain. It can integrate CNN and Transformer to learn the global context information. Hence, we develop a Transformer and U-Net-based anomaly detection method as well.Method In this study, Transformer is embedded in a naive U-Net to learn local and global spatial-temporal information of normal events. First, an encoder is designed to extract spatial-temporal features from consecutive frames. To encode global information and learn the relevant information between feature pixels, final features of the encoder are fed into the Transformer. Then, a decoder is used to upsample the features of Transformer, and merges them with the low-level features of the encoder with the same resolution via skip connections. The whole network can combine the global spatial-temporal information with the local detail information. The size of the convolution kernel and deconvolution kernel is set to 3 × 3. The maximum pooling kernel size is 2 × 2. The encoder and decoder have four layers both.To make predicted frames close to their ground truth, we alleviate the intensity and gradient distances between predicted frames and their ground truth. To meet the requirements for anomaly detection of close-range rehabilitation movement, we collected an indoor motion dataset from published datasets based on hand movements for anomaly analysis because existing anomaly detection datasets are based on outdoor settings with long-distance attribution. For periodic hand movements, in addition to the traditional reconstruction loss, we introduce a dynamic image constraint to guide the network to focus on the periodic close-range motion area further.Result We compare the proposed approach to several anomaly detection methods on four outdoor public datasets and one indoor dataset. The improvements of the frame-level area under curve(AUC) performance on Avenue, Ped1, and Ped2 are 1. 0%, 0. 4%, and 1. 1%, respectively. It can detect abnormal events on Ped1/Ped2 with the low-resolution attribute effectively. On the LV dataset, it achieves an AUC of 65. 1%. Since the Transformerbased network can capture richer feature information in terms of the self-attention mechanism, the proposed network can mine various normal patterns in multiple scenes and improve detection performance effectively. On the collected indoor dataset, our performance of four actions, which are denoted as A1-1, A1-2, A1-3, and A1-4, reached 60. 3%, 63. 4%, 67. 7%, and 64. 4%, respectively. To verify the effectiveness of the Transformer module and dynamic image constraint, we conduct the ablation experiments in the training phase through removing the Transformer module and dynamic image constraint. Experimental results show that the Transformer module can improve the performance of anomaly detection. The performance of four actions of using the dynamic image constraint in the indoor dataset are improved by 0. 6%, 2. 4%, 1. 1%, and 0. 9%, respectively. It means the dynamic image loss can yield the network to pay attention to the foreground motion area.Conclusion We develop a video anomaly detection method in relevant to Transformer and U-Net. A dataset of indoor motion is collected for the abnormal analysis of indoor close-up rehabilitation movement. Experimental results show that our method has its potentials to detect abnormal behaviors in indoor and outdoor videos effectively.