Abstract:Over the past few years, extensive research has been devoted to enhancing YOLO object detectors. Since its introduction, eight major versions of YOLO have been introduced with the purpose of improving its accuracy and efficiency. While the evident merits of YOLO have yielded to its extensive use in many areas, deploying it on resource-limited devices poses challenges. To address this issue, various neural network compression methods have been developed, which fall under three main categories, namely network pruning, quantization, and knowledge distillation. The fruitful outcomes of utilizing model compression methods, such as lowering memory usage and inference time, make them favorable, if not necessary, for deploying large neural networks on hardware-constrained edge devices. In this review paper, our focus is on pruning and quantization due to their comparative modularity. We categorize them and analyze the practical results of applying those methods to YOLOv5. By doing so, we identify gaps in adapting pruning and quantization for compressing YOLOv5, and provide future directions in this area for further exploration. Among several versions of YOLO, we specifically choose YOLOv5 for its excellent trade-off between recency and popularity in literature. This is the first specific review paper that surveys pruning and quantization methods from an implementation point of view on YOLOv5. Our study is also extendable to newer versions of YOLO as implementing them on resource-limited devices poses the same challenges that persist even today. This paper targets those interested in the practical deployment of model compression methods on YOLOv5, and in exploring different compression techniques that can be used for subsequent versions of YOLO.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deployment challenges of YOLOv5 on resource - constrained devices. Specifically, although the YOLO series of object detectors perform excellently in terms of accuracy and efficiency, their complexity and large model size make it difficult to directly deploy on edge devices (such as mobile devices, embedded systems, etc.). To solve this problem, the author reviews a variety of neural network compression methods, especially pruning and quantization, to reduce the memory footprint and inference time of YOLOv5, so that it can effectively run on edge devices with limited hardware resources. ### Main Research Contents 1. **Pruning**: - **Definition**: Pruning refers to removing redundant or unimportant parameters in a neural network to obtain a more compact model structure. - **Application**: The paper details different types of pruning techniques, including pruning methods based on ℓn - norm, feature map activation, batch normalization scaling factor (BNSF), first - order derivative, and mutual information. These methods can be unstructured pruning or structured pruning, depending on the granularity of pruning. - **Results**: Through pruning, the number of parameters, model size, floating - point operations (FLOPs), and inference time of YOLOv5 are significantly reduced while maintaining the model's accuracy as much as possible. 2. **Quantization**: - **Definition**: Quantization refers to using low - precision data types (such as 8 - bit integers) to represent the weights and activation values of a model to reduce storage requirements and computational overhead. - **Application**: The paper discusses the application of different quantization techniques, including post - training quantization and quantization - aware training. These techniques can significantly reduce the model's memory footprint and inference latency without affecting the model's performance. - **Results**: The quantized YOLOv5 model has a significantly improved inference speed on edge devices, and its memory footprint and power consumption are greatly reduced. ### Future Directions The author also points out the shortcomings of current pruning and quantization methods in adapting to YOLOv5 and proposes future research directions, such as: - Exploring more efficient pruning and quantization algorithms to further improve the model compression effect. - Combining other compression techniques (such as knowledge distillation) to achieve better performance. - Studying how to apply these compression methods to newer versions of the YOLO model (such as YOLOv6, YOLOv7, YOLOv8) to meet a wider range of real - world requirements. In summary, this paper aims to provide guidance for the efficient deployment of YOLOv5 on resource - constrained devices by reviewing existing pruning and quantization methods and to point out the direction for future model compression research.

Model Compression Methods for YOLOv5: A Review

Quantizing YOLOv7: A Comprehensive Study

YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO

YOLOv5, YOLOv8 and YOLOv10: The Go-To Detectors for Real-time Vision

A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS

Evaluating the Evolution of YOLO (You Only Look Once) Models: A Comprehensive Benchmark Study of YOLO11 and Its Predecessors

Compressing YOLO Network by Compressive Sensing

Overview of Research on Object Detection Based on YOLO

Yolo Versions Architecture: Review

A review of the development of YOLO object detection algorithm

Pruning at a Glance: Global Neural Pruning for Model Compression

A survey of model compression strategies for object detection

A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond

What is YOLOv5: A deep look into the internal features of the popular object detector

Model Compression for Deep Neural Networks: A Survey

YOLOv10: Real-Time End-to-End Object Detection

Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks

What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector

YOLO-based Object Detection Models: A Review and its Applications

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models