JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

Pietro Nardelli,Danilo Comminiello
2024-08-04
Abstract:The increasing proliferation of video surveillance cameras and the escalating demand for crime prevention have intensified interest in the task of violence detection within the research community. Compared to other action recognition tasks, violence detection in surveillance videos presents additional issues, such as the wide variety of real fight scenes. Unfortunately, existing datasets for violence detection are relatively small in comparison to those for other action recognition tasks. Moreover, surveillance footage often features different individuals in each video and varying backgrounds for each camera. In addition, fast detection of violent actions in real-life surveillance videos is crucial to prevent adverse outcomes, thus necessitating models that are optimized for reduced memory usage and computational costs. These challenges complicate the application of traditional action recognition methods. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model processes two spatiotemporal video streams, namely RGB frames and optical flows, and incorporates a new regularized self-supervised learning approach for videos. JOSENet demonstrates improved performance compared to state-of-the-art methods, while utilizing only one-fourth of the frames per video segment and operating at a reduced frame rate. The source code is available at <a class="link-external link-https" href="https://github.com/ispamm/JOSENet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The paper attempts to address the problem of detecting violent behavior in surveillance videos. Specifically, the paper focuses on the following key challenges: 1. **Small dataset size**: Existing violence detection datasets are relatively small compared to datasets for other action recognition tasks, which limits the effectiveness of model training. 2. **High variation in background and actors**: The background and actors in surveillance videos often change, increasing the complexity and difficulty for the model. 3. **High real-time requirements**: Rapid detection of violent behavior is crucial for preventing adverse outcomes, thus requiring model optimization to reduce memory usage and computational costs. 4. **Scarcity of labeled data**: In actual surveillance videos, annotated data is very limited, making traditional supervised learning methods difficult to apply. To address these issues, the authors propose JOSENet, a new self-supervised framework aimed at improving the performance of violence detection in surveillance videos. JOSENet achieves this by processing two spatiotemporal video streams (RGB frames and optical flow) and employing a novel regularized self-supervised learning approach. Experimental results show that JOSENet can achieve performance superior to existing self-supervised solutions while reducing the number of frames per video segment and lowering the frame rate.