Spatiotemporal Object Detection for Improved Aerial Vehicle Detection in Traffic Monitoring

Kristina Telegraph,Christos Kyrkou
DOI: https://doi.org/10.1109/TAI.2024.3454566
2024-10-17
Abstract:This work presents advancements in multi-class vehicle detection using UAV cameras through the development of spatiotemporal object detection models. The study introduces a Spatio-Temporal Vehicle Detection Dataset (STVD) containing 6, 600 annotated sequential frame images captured by UAVs, enabling comprehensive training and evaluation of algorithms for holistic spatiotemporal perception. A YOLO-based object detection algorithm is enhanced to incorporate temporal dynamics, resulting in improved performance over single frame models. The integration of attention mechanisms into spatiotemporal models is shown to further enhance performance. Experimental validation demonstrates significant progress, with the best spatiotemporal model exhibiting a 16.22% improvement over single frame models, while it is demonstrated that attention mechanisms hold the potential for additional performance gains.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the multi - class vehicle detection performance of unmanned aerial vehicle (UAV) cameras in traffic monitoring by developing spatio - temporal object detection models. Specifically, the paper aims to overcome the challenges encountered by existing single - frame models when processing video data, such as occlusion, motion blur and illumination change problems, and use spatio - temporal information to improve detection accuracy and consistency. ### Problem Background Traditional object detection methods mainly focus on the processing of a single image. Although these methods have made significant progress in static image detection, when processing video data, due to the inability to effectively use the information in the time dimension, they perform poorly in dynamic scenes. For example, single - frame models are prone to occlusion problems in dense areas, it is difficult to maintain the consistency of detection results between frames, and can only access the limited features of a single frame. ### Solution To solve these problems, the paper proposes the following innovations: 1. **Constructing a Spatio - Temporal Object Detection Dataset (STVD)**: - Created a dataset containing 6,600 annotated continuous - frame images, covering three different types of vehicles (cars, trucks, buses), and obtained from aerial - view videos shot by UAVs on different road sections in the Republic of Cyprus. 2. **Improving the YOLOv5 Framework to Process Spatio - Temporal Data**: - Extended the YOLOv5 object detection framework through architecture enhancement and input representation changes, enabling it to process spatio - temporal data, thereby better capturing time dynamics. 3. **Introducing an Attention Mechanism**: - Integrated the attention mechanism into the spatio - temporal model, further improving the model performance. Experimental results show that the best spatio - temporal model improves the performance by 16.22% compared to the single - frame model, and the application of the attention mechanism also has the potential to bring more performance improvements. ### Experimental Verification Through a series of experiments, the author conducted quantitative and qualitative analyses of different models and examined the specific performance of each type of vehicle. The experimental results show that the spatio - temporal model has made significant progress in many aspects, especially in handling complex traffic scenes. ### Impact and Application This research not only improves the real - time detection ability of UAVs in traffic monitoring but also provides technical support for future intelligent transportation systems. Through more accurate traffic flow analysis, congestion point identification and accident detection, it can help optimize signal light control, replan traffic flow, and prevent secondary accidents. In the long run, these data can also provide valuable references for urban planners, helping them make wiser decisions on road expansion and infrastructure construction. In conclusion, by introducing spatio - temporal information and an attention mechanism, this paper solves the limitations of traditional single - frame models in video processing and provides new ideas and technical means for more efficient traffic monitoring.