Multi-Task Learning based Video Anomaly Detection with Attention

Mohammad Baradaran,Robert Bergevin
DOI: https://doi.org/10.48550/arXiv.2210.07697
2023-05-11
Abstract:Multi-task learning based video anomaly detection methods combine multiple proxy tasks in different branches to detect video anomalies in different situations. Most existing methods either do not combine complementary tasks to effectively cover all motion patterns, or the class of the objects is not explicitly considered. To address the aforementioned shortcomings, we propose a novel multi-task learning based method that combines complementary proxy tasks to better consider the motion and appearance features. We combine the semantic segmentation and future frame prediction tasks in a single branch to learn the object class and consistent motion patterns, and to detect respective anomalies simultaneously. In the second branch, we added several attention mechanisms to detect motion anomalies with attention to object parts, the direction of motion, and the distance of the objects from the camera. Our qualitative results show that the proposed method considers the object class effectively and learns motion with attention to the aforementioned important factors which results in a precise motion modeling and a better motion anomaly detection. Additionally, quantitative results show the superiority of our method compared with state-of-the-art methods.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve several key problems in video anomaly detection. Specifically, the existing multi - task learning methods have the following deficiencies when dealing with video anomaly detection: 1. **The combination of proxy tasks is not complementary enough and lacks interpretability**: In existing methods, the combination of different proxy tasks is often not complementary and difficult to interpret. 2. **Failure to effectively consider object categories**: Most methods do not fully consider the impact of object categories on anomaly detection. 3. **Not covering all motion anomaly situations**: Existing methods fail to comprehensively cover various motion anomaly situations. 4. **Context information is not fully utilized**: During the anomaly detection process, context information (such as object parts, motion directions, and distances) is not fully utilized. To solve these problems, the author proposes a new video - anomaly - detection method based on multi - task learning. This method combines three complementary proxy tasks to more comprehensively consider appearance and motion features, thereby improving the accuracy of anomaly detection. The following are the main contributions of this method: - **Proposing a new multi - task learning framework**: This framework combines three proxy tasks, namely "future frame prediction", "semantic segmentation", and "optical - flow - magnitude prediction", to more comprehensively consider appearance and motion features. - **Introducing the future semantic - segmentation - prediction task**: Combining the semantic - segmentation and future - frame - prediction tasks to form a new task - future semantic - segmentation - prediction, which is used to detect appearance and motion anomalies simultaneously. - **Designing a new attention mechanism**: By introducing spatial and channel - attention networks and a new attention network, the model can more accurately estimate the motion magnitudes of objects and consider factors such as object parts, motion directions, and distances. ### Formula Representation To ensure the correctness and readability of formulas, the following are some of the formulas involved in the paper represented in Markdown format: 1. **Calculating the direction and magnitude of optical flow**: \[ \text{Mag}, \text{Ang} = \text{OF}(I_{t - 1}, I_t) \] Here, \(\text{Mag}\) represents the magnitude of the optical flow, and \(\text{Ang}\) represents the angle of motion relative to the horizontal axis. 2. **Calculating motion - direction features**: \[ X = |\cos(\text{Ang})| \] \[ Y = |\sin(\text{Ang})| \] 3. **Calculating the anomaly score**: \[ S(t)=\sum|\text{Out}_{\text{student}}(I_t)-\text{Out}_{\text{teacher}}(I_t)| \] Here, \(\text{Out}_{\text{student}}(I_t)\) and \(\text{Out}_{\text{teacher}}(I_t)\) respectively represent the outputs of the student network and the teacher network, and the summation is carried out over all pixels in the anomaly map. 4. **Applying the Savitzky - Golay filter for temporal denoising**: \[ S_r(t)=\frac{1}{N}\sum_{i = - w}^{w}\alpha S(t + i) \] Here, \(S_r(t)\) represents the denoised anomaly score, \(N\) is the normalization factor, and \(\alpha\) and \(w\) are the convolution coefficient and window size respectively. Through these improvements, this method can more accurately identify abnormal events in video - anomaly detection, especially performing better in complex scenarios.