An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

Wei Huang,Xingyu Zheng,Xudong Ma,Haotong Qin,Chengtao Lv,Hong Chen,Jie Luo,Xiaojuan Qi,Xianglong Liu,Michele Magno
2024-07-19
Abstract:The LLaMA family has become one of the most powerful open-source Large Language Models (LLMs) and the popular LLM backbones of Multimodal Large Language Models (MLLMs), widely applied in Computer Vision (CV) and Natural Language Understanding (NLU) tasks. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to reveal LLaMA3's low-bit quantization performance. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers non-negligent degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate and analyze the performance of the latest large - language model (LLM) - LLaMA3 and its multi - modal extension (MLLM) under low - bit quantization. Specifically, the main research questions include: 1. **Impact of quantization on performance**: Explore the impact of low - bit quantization (1 - 8 bits) on the performance of LLaMA3 in language tasks and vision - language tasks, especially the degradation of the model under extremely low - bit widths (such as 2 bits or less). 2. **Applicability of existing quantization methods**: Evaluate the effectiveness and limitations of multiple existing quantization methods (such as Post - Training Quantization, LoRA - FineTuning, etc.) when dealing with LLaMA3, and reveal the performance differences of these methods on the latest generation of LLM. 3. **Challenges of performance degradation**: Identify and analyze the specific reasons for performance degradation during the quantization process, especially for the extremely large - scale pre - trained LLaMA3 model, and how to maintain its high performance in resource - constrained scenarios. 4. **Specialties of multi - modal models**: Research the performance of multi - modal models based on LLaMA3 (such as LLaVA - Next - 8B) under low - bit quantization, and explore the performance changes in tasks such as visual question answering. Through these studies, the paper aims to provide valuable references for future quantization techniques of LLM and MLLM, promoting these models to reduce computational and memory requirements while maintaining high precision, thereby enhancing their practicality. ### Formula Examples Some quantization methods involved in the paper can be represented by the following formulas: - **Round - To - Nearest (RTN)**: \[ q=\text{round}\left(\frac{x}{s}\right) \] where \(x\) is the original weight, \(s\) is the scaling factor, and \(q\) is the quantized value. - **GPTQ (Group - wise Post - Training Quantization)**: \[ q = \arg\min_{q\in\mathcal{Q}}\left\|W - S\cdot Q\right\|_F^2 \] where \(W\) is the original weight matrix, \(S\) is the scaling factor matrix, \(Q\) is the quantized weight matrix, and \(\mathcal{Q}\) is the quantization domain. These formulas show the key steps in the quantization process, ensuring that the performance of the model after quantization is as close as possible to that of the original model.