Abstract:Deep learning models have achieved tremendous success in most of the industries in recent years. The evolution of these models has also led to an increase in the model size and energy requirement, making it difficult to deploy in production on low compute devices. An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility. A wide range of solutions have been proposed by different researchers to reduce the size and complexity of such models, prominent among them are, Weight Quantization, Parameter Pruning, Network Pruning, low-rank representation, weights sharing, neural architecture search, knowledge distillation etc. In this research work, we investigate the performance impacts on various trained deep learning models, compressed using quantization and pruning techniques. We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements. We also explored performance of various large language models (LLMs) after quantization and low rank adaptation. We used the standard evaluation metrics (model's size, accuracy, and inference time) for all the related problem statements and concluded this paper by discussing the challenges and future work.

What problem does this paper attempt to address?

The paper primarily explores the techniques of deep learning model compression and their impact on the performance of different models. Specifically, the research objectives can be summarized as follows: 1. **Background and Challenges**: - Deep learning models have achieved great success in various industries, but their size and computational demands are also increasing, making them difficult to deploy on devices with limited computational resources. - As the number of model parameters grows, for example, natural language processing models increasing from billions to hundreds of billions or even trillions of parameters, and the floating-point operations (FLOPS) of commonly used models in computer vision tasks reaching billions, the need to compress these models becomes more urgent. 2. **Research Objectives**: - Evaluate the impact of various model compression techniques (such as quantization and pruning) on the performance of trained deep learning models. - Implement compression techniques like model quantization and pruning, and apply them to popular deep learning models in fields such as image classification, object detection, language models, and generative models. - Explore the performance of large-scale language models (LLMs) after quantization and low-rank adaptation. - Use standard evaluation metrics (such as model size, accuracy, inference time, etc.) to measure the performance of models before and after compression, and discuss the challenges and future work directions. 3. **Specific Issues**: - The research aims to address how to effectively reduce the size and complexity of deep learning models so that they can be deployed on edge devices. - Through techniques like quantization and pruning, researchers hope to reduce the model size, lower computational costs, and improve inference speed while maintaining model performance. In summary, the core objective of this paper is to study how to optimize and improve deep learning models through various compression methods, enabling them to run efficiently in resource-constrained environments, thereby broadening the application scenarios of these models.

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

A Survey on Model Compression for Large Language Models

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Deep learning model compression using network sensitivity and gradients

Aggressive Post-Training Compression on Extremely Large Language Models

Model Compression and Efficient Inference for Large Language Models: A Survey

The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

Model Compression for Deep Neural Networks: A Survey

Compression of Deep Learning Models for Text: A Survey

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

A Comprehensive Study on Quantization Techniques for Large Language Models

What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech Enhancement

Deep Learning Model Compression Techniques: Advances, Opportunities, and Perspective

Evaluating Large Language Models for Generalization and Robustness via Data Compression

Effective Interplay between Sparsity and Quantization: From Theory to Practice

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

A Novel Deep Learning Model Compression Algorithm

On Compressing Deep Models by Low Rank and Sparse Decomposition.