Abstract:Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of weakly supervised methods in Video Anomaly Event Detection (VAED). Specifically, the paper investigates the following points: 1. **Multimodal Feature Fusion**: - Existing unimodal methods find it difficult to capture the exact context of anomaly events from a single data record due to a lack of sufficient key information, leading to inaccurate anomaly evaluation. - Although bimodal methods combine RGB video and audio data, their performance is still unsatisfactory. 2. **Long-Range Context Capture**: - Methods that solely use frame-level or short-segment processing of videos limit the effective capture of long-range contextual information. 3. **Feature Effectiveness Enhancement**: - A multimodal attention-enhanced feature fusion system is proposed, utilizing three feature streams: RGB video, optical flow, and audio signals to extract complementary spatial and temporal features. - Each feature stream is enhanced through an Enhanced Attention Module to improve detection accuracy and robustness. ### Main Contributions 1. **RGB Video Stream**: - Utilizes a ViT-based CLIP module to select top-k features, capturing complex visual semantics and contextual information. - Combines a CNN-based I3D module with a Temporal Contextual Aggregation (TCA) mechanism to extract rich spatiotemporal features. - Further processes features through an Uncertainty-Regulated Dual Memory Units (UR-DMU) model and reduces feature dimensions via a Multi-Layer Perceptron (MLP). 2. **Optical Flow Stream**: - Computes motion optical flow from RGB video frames and inputs it into the I3D module to capture spatial and temporal information, emphasizing scene dynamics crucial for detecting anomalies related to abnormal motion. - Uses a Transformer to capture long-range dependencies and temporal patterns, generating the final optical flow feature stream. 3. **Audio Stream**: - Processes VGGish-extracted audio features using a Transformer to capture audio cues critical for anomaly detection. - The VGGish model converts audio input into detailed feature representations, which are further enhanced by the Transformer to capture temporal dependencies and contextual relationships, accurately identifying subtle audio anomalies. 4. **Gated Feature Fusion and Classification**: - Concatenates features from all three streams through a gated feature fusion mechanism with an attention module, generating a comprehensive final feature set for the classification module. - The classifier predicts segment-level anomaly scores, aggregating these scores into bag-level predictions during training to identify high activations in anomalous situations. 5. **Comprehensive Evaluation**: - Extensive experiments on the XD-Violence dataset and two other benchmark datasets demonstrate that this method outperforms existing state-of-the-art methods in anomaly detection performance, achieving significant improvements. In summary, this paper proposes a novel multimodal attention-enhanced feature fusion system aimed at improving the accuracy and robustness of video anomaly detection by integrating RGB video, optical flow, and audio signals.

Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

Multimodal and multiscale feature fusion for weakly supervised video anomaly detection

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Memory-Augmented Spatial-Temporal Consistency Network for Video Anomaly Detection.

Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection

Anomaly detection in surveillance videos using transformer based attention model

FE-VAD: High-Low Frequency Enhanced Weakly Supervised Video Anomaly Detection

Self-Attention Memory-Augmented Wavelet-CNN for Anomaly Detection

Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning

Real-world Video Anomaly Detection by Extracting Salient Features in Videos

MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

Learning Attention Augmented Spatial-temporal Normality for Video Anomaly Detection

Enhancing Video Anomaly Detection Using a Transformer Spatiotemporal Attention Unsupervised Framework for Large Datasets

Anomaly Detection Based on a 3D Convolutional Neural Network Combining Convolutional Block Attention Module Using Merged Frames

Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection

Attention-based anomaly detection in multi-view surveillance videos

Contrastive Attention for Video Anomaly Detection

Video Anomaly Detection Based on Global–Local Convolutional Autoencoder

Weakly-supervised Joint Anomaly Detection and Classification

Attention-based residual autoencoder for video anomaly detection