Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

Nada Osman,Marwan Torki
2024-08-25
Abstract:Anomaly action detection and localization play an essential role in security and advanced surveillance systems. However, due to the tremendous amount of surveillance videos, most of the available data for the task is unlabeled or semi-labeled with the video class known, but the location of the anomaly event is unknown. In this work, we target anomaly localization in semi-supervised videos. While the mainstream direction in addressing this task is focused on segment-level multi-instance learning and the generation of pseudo labels, we aim to explore a promising yet unfulfilled direction to solve the problem by learning the temporal relations within videos in order to locate anomaly events. To this end, we propose a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos with a divide-and-conquer strategy along the temporal axis. Our approach segments a parent video hierarchically into multiple temporal children instances and measures the influence of the children nodes in classifying the abnormality of the parent video. Evaluating our model on two well-known anomaly detection datasets, UCF-crime and ShanghaiTech, proves its ability to interpret the observed actions within videos and localize the anomalous ones. Our proposed approach outperforms previous works relying on segment-level multiple-instance learning approaches while reaching a promising performance compared to the more recent pseudo-labeling-based approaches.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **locating abnormal behaviors in semi - supervised videos**. Specifically, due to the huge amount of surveillance video data and most of it being unlabeled or only having video - level labels (that is, knowing whether the video is normal or abnormal, but not knowing the specific location of the abnormal event), how to accurately detect and locate abnormal behaviors from these videos has become an important research problem. ### Problem Background 1. **Importance of Surveillance Videos**: The surveillance system is at the core of almost all security systems, but extracting important events, especially abnormal events, from it is a time - consuming and cumbersome task. 2. **Lack of Supervised Data**: Most of the available data is unsupervised or weakly - supervised, which means that there is a lack of detailed labeling information when training the model. 3. **Limitations of Existing Methods**: - **Multi - Instance Learning (MIL)**: Many works focus on multi - instance learning at the segment level and improve the model performance by generating pseudo - labels. - **Pseudo - Label Generation**: Some methods train the model by generating pseudo - labels, but this depends on specific datasets and may introduce noise. - **Image Reconstruction**: There are also some methods that train the model to reconstruct normal videos or frames, and then use the reconstruction quality to distinguish between normal and abnormal frames. ### New Method Proposed in the Paper To solve the above problems, the author proposes a new direction - locating abnormal events by learning the temporal relationships in videos. Specific methods include: 1. **Divide - and - Conquer Strategy**: Hierarchically segment the video into multiple temporal sub - segments, and evaluate the degree of abnormality of each sub - segment through a hierarchical Transformer model. 2. **Dual - Scale Feature Extractor (DS - Φ)**: Enhance the temporal resolution, enabling the model to process video segments more delicately. 3. **Hierarchical Transformer Layers**: Through multi - level self - attention mechanisms, gradually refine the classification of video segments. 4. **Prediction Head**: Combine the class activation map (CAM) and attention weights to generate the final abnormal classification result. ### Main Contributions 1. **Task Reconstruction**: Transform the segment - level abnormal detection task into a video - level abnormal location task, so that the learning process can benefit from video classification. 2. **Divide - and - Conquer Model**: Propose a Transformer model based on the temporal divide - and - conquer strategy, which can evaluate the degree of abnormality at different temporal scales. 3. **Experimental Verification**: Evaluate on two well - known datasets (UCF - Crime and ShanghaiTech), and conduct ablation experiments to verify the effectiveness of each component. ### Formula Summary The formulas involved in the paper mainly include: - Feature Mapping: \(\Phi(S)=\text{MLP}(\text{Feat}(S))\) (Formula 1) - Dual - Scale Feature Extraction: \(x_{2i - 1}=\Phi(S_i)+\Phi(S^1_i)\), \(x_{2i}=\Phi(S_i)+\Phi(S^2_i)\) (Formula 2) - Hierarchical Transformer Output: \(c^j_k, w^j_k, h^j_k = \text{TL}(Cls\oplus h^{\lceil j/2\rceil}_{k - 1})\) (Formula 3) - Abnormal Classification: \((y^j_k)^c=\text{Sigmoid}(\text{MLP}(c^j_k))\) and \((y^j_k)^h=\text{Sigmoid}(\text{SLP}(\text{AveragePooling}(h^j_k)))\) (Formula 4 and 5) - Final Prediction: \(y^j_k=\text{Average}((y^j_k)^c+(y^j_k)^h)\) (Formula 6) Through these methods, this paper provides a new and more general - purpose solution that can effectively detect and locate abnormal events without relying on pseudo - labels.