A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS

Juan Terven,Diana Cordova-Esparza
DOI: https://doi.org/10.3390/make5040083
2024-02-05
Abstract:YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with Transformers. We start by describing the standard metrics and postprocessing; then, we discuss the major changes in network architecture and training tricks for each model. Finally, we summarize the essential lessons from YOLO's development and provide a perspective on its future, highlighting potential research directions to enhance real-time object detection systems.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to comprehensively review the development of the YOLO (You Only Look Once) architecture in computer vision, from the initial YOLOv1 to the latest YOLOv8, YOLO - NAS, and the YOLO model with the introduction of Transformer. Specifically, the paper mainly focuses on the following aspects: 1. **Requirements for real - time object detection**: With the rapid development of fields such as self - driving cars, robotics, video surveillance, and augmented reality, real - time object detection has become a key component. The YOLO series stands out due to its excellent balance between speed and accuracy, but there are performance differences among various versions. 2. **Evolution of the YOLO architecture**: Since the release of YOLOv1, subsequent versions have been continuously improved to overcome the limitations of earlier versions and enhance performance. The paper analyzes in detail the main changes in each iterative version, including innovations in network structure, training techniques, etc. 3. **Evaluation metrics and post - processing methods**: In order to better understand the performance of the YOLO series models, the paper introduces commonly used evaluation metrics such as AP (Average Precision) and its calculation method, and discusses post - processing techniques such as Non - Maximum Suppression (NMS). 4. **Future development directions**: Based on the research on the existing YOLO architecture, the paper also explores possible future research directions in this field, aiming to further enhance the performance of real - time object detection systems. Through the above content, the paper not only summarizes the development process of the YOLO framework but also provides readers with guidance on choosing the best YOLO model suitable for specific application scenarios and points out potential research paths. ### Formula presentation - **Average Precision (AP)**: \[ AP=\frac{\sum_{r}(\text{Precision}(r)\times\Delta\text{Recall}(r))}{\text{Total Number of Relevant Instances}} \] where $\text{Precision}(r)$ is the precision at recall rate $r$, and $\Delta\text{Recall}(r)$ is the change in recall rate. - **Intersection over Union (IoU)**: \[ IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}=\frac{|A\cap B|}{|A\cup B|} \] where $A$ and $B$ are the predicted box and the ground - truth box respectively. These formulas are used to measure the performance of object detection models, especially in different versions of YOLO, how to improve detection accuracy and speed by improving network structures and training methods.