Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection

Yuta Kaneko,Abu Saleh Musa Miah,Najmul Hassan,Hyoun-Sup Lee,Si-Woong Jang,Jungpil Shin
2024-09-17
Abstract:Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of weakly supervised methods in Video Anomaly Event Detection (VAED). Specifically, the paper investigates the following points: 1. **Multimodal Feature Fusion**: - Existing unimodal methods find it difficult to capture the exact context of anomaly events from a single data record due to a lack of sufficient key information, leading to inaccurate anomaly evaluation. - Although bimodal methods combine RGB video and audio data, their performance is still unsatisfactory. 2. **Long-Range Context Capture**: - Methods that solely use frame-level or short-segment processing of videos limit the effective capture of long-range contextual information. 3. **Feature Effectiveness Enhancement**: - A multimodal attention-enhanced feature fusion system is proposed, utilizing three feature streams: RGB video, optical flow, and audio signals to extract complementary spatial and temporal features. - Each feature stream is enhanced through an Enhanced Attention Module to improve detection accuracy and robustness. ### Main Contributions 1. **RGB Video Stream**: - Utilizes a ViT-based CLIP module to select top-k features, capturing complex visual semantics and contextual information. - Combines a CNN-based I3D module with a Temporal Contextual Aggregation (TCA) mechanism to extract rich spatiotemporal features. - Further processes features through an Uncertainty-Regulated Dual Memory Units (UR-DMU) model and reduces feature dimensions via a Multi-Layer Perceptron (MLP). 2. **Optical Flow Stream**: - Computes motion optical flow from RGB video frames and inputs it into the I3D module to capture spatial and temporal information, emphasizing scene dynamics crucial for detecting anomalies related to abnormal motion. - Uses a Transformer to capture long-range dependencies and temporal patterns, generating the final optical flow feature stream. 3. **Audio Stream**: - Processes VGGish-extracted audio features using a Transformer to capture audio cues critical for anomaly detection. - The VGGish model converts audio input into detailed feature representations, which are further enhanced by the Transformer to capture temporal dependencies and contextual relationships, accurately identifying subtle audio anomalies. 4. **Gated Feature Fusion and Classification**: - Concatenates features from all three streams through a gated feature fusion mechanism with an attention module, generating a comprehensive final feature set for the classification module. - The classifier predicts segment-level anomaly scores, aggregating these scores into bag-level predictions during training to identify high activations in anomalous situations. 5. **Comprehensive Evaluation**: - Extensive experiments on the XD-Violence dataset and two other benchmark datasets demonstrate that this method outperforms existing state-of-the-art methods in anomaly detection performance, achieving significant improvements. In summary, this paper proposes a novel multimodal attention-enhanced feature fusion system aimed at improving the accuracy and robustness of video anomaly detection by integrating RGB video, optical flow, and audio signals.