Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Aayush Saxena,Arit Kumar Bishwas,Ayush Ashok Mishra,Ryan Armstrong
2024-07-22
Abstract:Deep learning models have achieved tremendous success in most of the industries in recent years. The evolution of these models has also led to an increase in the model size and energy requirement, making it difficult to deploy in production on low compute devices. An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility. A wide range of solutions have been proposed by different researchers to reduce the size and complexity of such models, prominent among them are, Weight Quantization, Parameter Pruning, Network Pruning, low-rank representation, weights sharing, neural architecture search, knowledge distillation etc. In this research work, we investigate the performance impacts on various trained deep learning models, compressed using quantization and pruning techniques. We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements. We also explored performance of various large language models (LLMs) after quantization and low rank adaptation. We used the standard evaluation metrics (model's size, accuracy, and inference time) for all the related problem statements and concluded this paper by discussing the challenges and future work.
Machine Learning
What problem does this paper attempt to address?
The paper primarily explores the techniques of deep learning model compression and their impact on the performance of different models. Specifically, the research objectives can be summarized as follows: 1. **Background and Challenges**: - Deep learning models have achieved great success in various industries, but their size and computational demands are also increasing, making them difficult to deploy on devices with limited computational resources. - As the number of model parameters grows, for example, natural language processing models increasing from billions to hundreds of billions or even trillions of parameters, and the floating-point operations (FLOPS) of commonly used models in computer vision tasks reaching billions, the need to compress these models becomes more urgent. 2. **Research Objectives**: - Evaluate the impact of various model compression techniques (such as quantization and pruning) on the performance of trained deep learning models. - Implement compression techniques like model quantization and pruning, and apply them to popular deep learning models in fields such as image classification, object detection, language models, and generative models. - Explore the performance of large-scale language models (LLMs) after quantization and low-rank adaptation. - Use standard evaluation metrics (such as model size, accuracy, inference time, etc.) to measure the performance of models before and after compression, and discuss the challenges and future work directions. 3. **Specific Issues**: - The research aims to address how to effectively reduce the size and complexity of deep learning models so that they can be deployed on edge devices. - Through techniques like quantization and pruning, researchers hope to reduce the model size, lower computational costs, and improve inference speed while maintaining model performance. In summary, the core objective of this paper is to study how to optimize and improve deep learning models through various compression methods, enabling them to run efficiently in resource-constrained environments, thereby broadening the application scenarios of these models.