Abstract:This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory. The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream. To achieve this, a novel approach combining temporal and spatial analysis has been proposed. Several avenues have been explored to improve anomaly detection by integrating object detection, human pose detection, and motion analysis. For result interpretability, techniques commonly used for image analysis, such as activation and saliency maps, have been extended to videos, and an original method has been proposed. The proposed architecture performs binary or multiclass classification depending on whether an alert or the cause needs to be identified. Numerous neural networkmodels have been tested, and three of them have been selected. You Only Looks Once (YOLO) has been used for spatial analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19 and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer perceptron for classification. These models handle different types of data and can be combined in parallel or in series. Although the parallel mode is faster, the serial mode is generally more reliable. For training these models, supervised learning was chosen, and two proprietary datasets were created. The first dataset focuses on objects that may play a potential role in anomalies, while the second consists of videos containing anomalies or non-anomalies. This approach allows for the processing of both continuous video streams and finite videos, providing greater flexibility in detection.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **To achieve real - time anomaly detection in video streams in order to improve security and response speed**. Specifically, the paper aims to develop an artificial intelligence system that can quickly identify potential dangerous events in videos and trigger alarms. To achieve this goal, the author proposes a new method that combines spatio - temporal analysis.
### Specific description of the problem
1. **Definition and classification of anomaly detection**:
- Anomaly detection refers to identifying elements, events, or observations that do not conform to the expected model. According to different application scenarios, anomalies can be divided into intrusion detection, fraud detection, medical anomaly detection, industrial damage detection, text anomaly detection, and visual anomaly detection, etc.
- According to the frequency and impact of anomalies, anomalies can be divided into point anomalies, contextual anomalies, and collective anomalies. This paper mainly focuses on contextual anomalies, that is, abnormal behaviors that occur in specific situations.
2. **Research objectives**:
- The goal of the paper is to develop an artificial intelligence system that can detect abnormal events in video streams in real - time, especially those abnormal events that have a direct impact on the safety of individuals or groups.
- Specifically, the author hopes to improve the accuracy and response speed of anomaly detection by combining spatial analysis (such as object detection, human pose detection) and time - series analysis (such as action recognition).
3. **Requirements for video analysis**:
- Video analysis can be carried out in multiple ways, including analyzing only audio, analyzing each frame independently, analyzing frame sequences, and analyzing audio and images simultaneously. Since most surveillance videos do not contain audio, this paper mainly focuses on the analysis of frame sequences.
- Temporal analysis is used to detect the time span of events, while spatial analysis is used to identify the specific location where events occur. Combining these two analysis methods can lead to a more comprehensive understanding of abnormal behaviors in videos.
4. **Requirements for data sets**:
- In order to train and evaluate anomaly detection models, data sets that contain a large number of normal and abnormal scenarios are required. Although existing public data sets cover some anomaly types, they often have problems such as insufficient data volume and class imbalance.
- Therefore, the author created two proprietary data sets: one contains images of objects that may cause anomalies, and the other contains videos labeled with abnormal or non - abnormal labels. These data sets provide more abundant and diverse samples for model training.
### Solutions
To achieve the above - mentioned goals, the author proposes the following solutions:
- **Combining spatio - temporal analysis**: By introducing spatial analysis (such as using YOLO for object detection) and time - series analysis (such as using CRNN combined with VGG19 and GRU for action recognition), the accuracy of anomaly detection is improved.
- **Multi - modal architecture**: A multi - modal architecture is designed, which can operate in parallel or serial mode to adapt to different application scenarios. The parallel mode is faster, but the serial mode is usually more reliable.
- **Supervised learning**: The supervised learning method is selected, and two proprietary data sets are created for model training.
- **Interpretability techniques**: Techniques such as activation maps and saliency maps are introduced to enhance the interpretability of model results.
In summary, the main contribution of this paper is to propose a new anomaly detection method that combines spatio - temporal analysis, and through the combination of multiple neural network models, efficient detection of abnormal events in video streams is achieved.