Abstract:This paper proposes a hybrid fusion-based deep learning approach based on two different modalities, audio and video, to improve human activity recognition and violence detection in public places. To take advantage of audiovisual fusion, late fusion, intermediate fusion, and hybrid fusion-based deep learning (HFBDL) are used and compared. Since the objective is to detect and recognize human violence in public places, Real-life violence situation (RLVS) dataset is expanded and used. Simulating results of HFBDL show 96.67\% accuracy on validation data, which is more accurate than the other state-of-the-art methods on this dataset. To showcase our model's ability in real-world scenarios, another dataset of 54 sounded videos of both violent and non-violent situations was recorded. The model could successfully detect 52 out of 54 videos correctly. The proposed method shows a promising performance on real scenarios. Thus, it can be used for human action recognition and violence detection in public places for security purposes.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to improve the accuracy of human activity recognition and violence detection in public places by combining deep learning methods for audio and video modalities. Specifically, the authors propose a Hybrid Fusion-Based Deep Learning (HFBDL) method based on two different modalities (audio and video) to enhance the effectiveness of human activity recognition and violence detection. ### Background and Motivation In recent years, the frequent occurrence of violent incidents in public places and densely populated areas has raised significant concerns about public safety. To address this challenge, surveillance cameras have been widely deployed in various locations. However, with the increasing number of cameras, more operators and supervisors are needed to monitor video streams in real-time, which presents significant challenges and limitations. Therefore, developing automated Video Surveillance Systems (VSSs) has become crucial to ensure public safety. ### Research Methods 1. **Dataset**: The authors extended and used the Real-life Violence Situation (RLVS) dataset, which contains real violent and non-violent scenes. To ensure the diversity and applicability of the dataset, all videos include relevant audio information. 2. **Model Architecture**: - **Audio Model**: The pre-trained VGGish model is used to extract audio features. - **Video Model**: The pre-trained I3D model is used to extract video features. - **Fusion Strategy**: Three fusion strategies—early fusion, intermediate fusion, and late fusion—were studied, and a hybrid fusion method (HFBDL) was proposed, combining the advantages of intermediate and late fusion. 3. **Data Augmentation**: To improve the model's generalization ability and accuracy, data augmentation techniques were employed, including color jittering, rotation, adding noise, horizontal/vertical flipping, Gaussian blur, median blur, brightness/contrast adjustment, etc. ### Experimental Results - **Accuracy on Validation Set**: The HFBDL method achieved an accuracy of 96.67% on the validation set, outperforming other state-of-the-art methods. - **Real-World Testing**: In another dataset containing 54 violent and non-violent videos with sound, the model successfully detected 52 videos correctly, demonstrating its effectiveness in practical applications. ### Application Prospects This method has broad application prospects in safety monitoring in public places and can be used for real-time violence detection to enhance public safety. For example, interactive robots can be deployed in public places like airports to continuously monitor the surrounding environment's video and audio and promptly alert authorized personnel. ### Conclusion This paper proposes an effective hybrid fusion strategy by combining deep learning methods for audio and video modalities, significantly improving the accuracy of human activity recognition and violence detection in public places. This research outcome is of great significance for developing more efficient and reliable surveillance systems.

Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion

Efficient Audiovisual Fusion for Active Speaker Detection.

Detecting Violence in Video Based on Deep Features Fusion Technique

DeepSafety: Multi-level Audio-Text Feature Extraction and Fusion Approach for Violence Detection in Conversations

Multimodal fusion for audio-image and video action recognition

Novel Deep Feature Fusion Framework for Multi-Scenario Violence Detection

Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Violent Video Detection Based on Semantic Correspondence.

Deep Learning for Activity Recognition Using Audio and Video

Audiovisual Dependency Attention for Violence Detection in Videos

Detecting Violence in Video using Subclasses

Mobile Neural Architecture Search Network and Convolutional Long Short-Term Memory-Based Deep Features Toward Detecting Violence from Video

Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data

An ensemble based approach for violence detection in videos using deep transfer learning

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Violent Interaction Detection in Video Based on Deep Learning

Efficient Human Violence Recognition for Surveillance in Real Time

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

Towards Real-world Violence Recognition via Efficient Deep Features and Sequential Patterns Analysis