Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion

Pooya Janani,Amirabolfazl Suratgar,Afshin Taghvaeipour
2024-08-04
Abstract:This paper proposes a hybrid fusion-based deep learning approach based on two different modalities, audio and video, to improve human activity recognition and violence detection in public places. To take advantage of audiovisual fusion, late fusion, intermediate fusion, and hybrid fusion-based deep learning (HFBDL) are used and compared. Since the objective is to detect and recognize human violence in public places, Real-life violence situation (RLVS) dataset is expanded and used. Simulating results of HFBDL show 96.67\% accuracy on validation data, which is more accurate than the other state-of-the-art methods on this dataset. To showcase our model's ability in real-world scenarios, another dataset of 54 sounded videos of both violent and non-violent situations was recorded. The model could successfully detect 52 out of 54 videos correctly. The proposed method shows a promising performance on real scenarios. Thus, it can be used for human action recognition and violence detection in public places for security purposes.
Computer Vision and Pattern Recognition,Machine Learning,Multimedia,Image and Video Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to improve the accuracy of human activity recognition and violence detection in public places by combining deep learning methods for audio and video modalities. Specifically, the authors propose a Hybrid Fusion-Based Deep Learning (HFBDL) method based on two different modalities (audio and video) to enhance the effectiveness of human activity recognition and violence detection. ### Background and Motivation In recent years, the frequent occurrence of violent incidents in public places and densely populated areas has raised significant concerns about public safety. To address this challenge, surveillance cameras have been widely deployed in various locations. However, with the increasing number of cameras, more operators and supervisors are needed to monitor video streams in real-time, which presents significant challenges and limitations. Therefore, developing automated Video Surveillance Systems (VSSs) has become crucial to ensure public safety. ### Research Methods 1. **Dataset**: The authors extended and used the Real-life Violence Situation (RLVS) dataset, which contains real violent and non-violent scenes. To ensure the diversity and applicability of the dataset, all videos include relevant audio information. 2. **Model Architecture**: - **Audio Model**: The pre-trained VGGish model is used to extract audio features. - **Video Model**: The pre-trained I3D model is used to extract video features. - **Fusion Strategy**: Three fusion strategies—early fusion, intermediate fusion, and late fusion—were studied, and a hybrid fusion method (HFBDL) was proposed, combining the advantages of intermediate and late fusion. 3. **Data Augmentation**: To improve the model's generalization ability and accuracy, data augmentation techniques were employed, including color jittering, rotation, adding noise, horizontal/vertical flipping, Gaussian blur, median blur, brightness/contrast adjustment, etc. ### Experimental Results - **Accuracy on Validation Set**: The HFBDL method achieved an accuracy of 96.67% on the validation set, outperforming other state-of-the-art methods. - **Real-World Testing**: In another dataset containing 54 violent and non-violent videos with sound, the model successfully detected 52 videos correctly, demonstrating its effectiveness in practical applications. ### Application Prospects This method has broad application prospects in safety monitoring in public places and can be used for real-time violence detection to enhance public safety. For example, interactive robots can be deployed in public places like airports to continuously monitor the surrounding environment's video and audio and promptly alert authorized personnel. ### Conclusion This paper proposes an effective hybrid fusion strategy by combining deep learning methods for audio and video modalities, significantly improving the accuracy of human activity recognition and violence detection in public places. This research outcome is of great significance for developing more efficient and reliable surveillance systems.