Abstract:Anomaly action detection and localization play an essential role in security and advanced surveillance systems. However, due to the tremendous amount of surveillance videos, most of the available data for the task is unlabeled or semi-labeled with the video class known, but the location of the anomaly event is unknown. In this work, we target anomaly localization in semi-supervised videos. While the mainstream direction in addressing this task is focused on segment-level multi-instance learning and the generation of pseudo labels, we aim to explore a promising yet unfulfilled direction to solve the problem by learning the temporal relations within videos in order to locate anomaly events. To this end, we propose a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos with a divide-and-conquer strategy along the temporal axis. Our approach segments a parent video hierarchically into multiple temporal children instances and measures the influence of the children nodes in classifying the abnormality of the parent video. Evaluating our model on two well-known anomaly detection datasets, UCF-crime and ShanghaiTech, proves its ability to interpret the observed actions within videos and localize the anomalous ones. Our proposed approach outperforms previous works relying on segment-level multiple-instance learning approaches while reaching a promising performance compared to the more recent pseudo-labeling-based approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **locating abnormal behaviors in semi - supervised videos**. Specifically, due to the huge amount of surveillance video data and most of it being unlabeled or only having video - level labels (that is, knowing whether the video is normal or abnormal, but not knowing the specific location of the abnormal event), how to accurately detect and locate abnormal behaviors from these videos has become an important research problem. ### Problem Background 1. **Importance of Surveillance Videos**: The surveillance system is at the core of almost all security systems, but extracting important events, especially abnormal events, from it is a time - consuming and cumbersome task. 2. **Lack of Supervised Data**: Most of the available data is unsupervised or weakly - supervised, which means that there is a lack of detailed labeling information when training the model. 3. **Limitations of Existing Methods**: - **Multi - Instance Learning (MIL)**: Many works focus on multi - instance learning at the segment level and improve the model performance by generating pseudo - labels. - **Pseudo - Label Generation**: Some methods train the model by generating pseudo - labels, but this depends on specific datasets and may introduce noise. - **Image Reconstruction**: There are also some methods that train the model to reconstruct normal videos or frames, and then use the reconstruction quality to distinguish between normal and abnormal frames. ### New Method Proposed in the Paper To solve the above problems, the author proposes a new direction - locating abnormal events by learning the temporal relationships in videos. Specific methods include: 1. **Divide - and - Conquer Strategy**: Hierarchically segment the video into multiple temporal sub - segments, and evaluate the degree of abnormality of each sub - segment through a hierarchical Transformer model. 2. **Dual - Scale Feature Extractor (DS - Φ)**: Enhance the temporal resolution, enabling the model to process video segments more delicately. 3. **Hierarchical Transformer Layers**: Through multi - level self - attention mechanisms, gradually refine the classification of video segments. 4. **Prediction Head**: Combine the class activation map (CAM) and attention weights to generate the final abnormal classification result. ### Main Contributions 1. **Task Reconstruction**: Transform the segment - level abnormal detection task into a video - level abnormal location task, so that the learning process can benefit from video classification. 2. **Divide - and - Conquer Model**: Propose a Transformer model based on the temporal divide - and - conquer strategy, which can evaluate the degree of abnormality at different temporal scales. 3. **Experimental Verification**: Evaluate on two well - known datasets (UCF - Crime and ShanghaiTech), and conduct ablation experiments to verify the effectiveness of each component. ### Formula Summary The formulas involved in the paper mainly include: - Feature Mapping: \(\Phi(S)=\text{MLP}(\text{Feat}(S))\) (Formula 1) - Dual - Scale Feature Extraction: \(x_{2i - 1}=\Phi(S_i)+\Phi(S^1_i)\), \(x_{2i}=\Phi(S_i)+\Phi(S^2_i)\) (Formula 2) - Hierarchical Transformer Output: \(c^j_k, w^j_k, h^j_k = \text{TL}(Cls\oplus h^{\lceil j/2\rceil}_{k - 1})\) (Formula 3) - Abnormal Classification: \((y^j_k)^c=\text{Sigmoid}(\text{MLP}(c^j_k))\) and \((y^j_k)^h=\text{Sigmoid}(\text{SLP}(\text{AveragePooling}(h^j_k)))\) (Formula 4 and 5) - Final Prediction: \(y^j_k=\text{Average}((y^j_k)^c+(y^j_k)^h)\) (Formula 6) Through these methods, this paper provides a new and more general - purpose solution that can effectively detect and locate abnormal events without relying on pseudo - labels.

Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

Anomaly detection in surveillance videos using Transformer with margin learning

Enhancing Video Anomaly Detection Using a Transformer Spatiotemporal Attention Unsupervised Framework for Large Datasets

Anomaly detection in surveillance videos using transformer based attention model

Configurable Spatial-Temporal Hierarchical Analysis for Flexible Video Anomaly Detection

Transformer-based Spatio-Temporal Unsupervised Traffic Anomaly Detection in Aerial Videos

Temporal Deformable Transformer for Action Localization

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization

Self-Supervised Video Action Localization with Adversarial Temporal Transforms.

Campus Abnormal Behavior Recognition With Temporal Segment Transformers

Temporal Segment Transformer for Action Segmentation

TransAnomaly: Video Anomaly Detection Using Video Vision Transformer

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Localizing Anomalies From Weakly-Labeled Videos

Multi-Scale Temporal Relations and Segmented Channel Attention for Video Anomaly Detection

Hierarchical Graph Embedded Pose Regularity Learning via Spatio-Temporal Transformer for Abnormal Behavior Detection

Temporal Transformer Networks with Self-Supervision for Action Recognition.

Multi-granularity transformer fusion for temporal action localization

Anomaly Detection Via Midlevel Visual Attributes