Abstract:The LLaMA family has become one of the most powerful open-source Large Language Models (LLMs) and the popular LLM backbones of Multimodal Large Language Models (MLLMs), widely applied in Computer Vision (CV) and Natural Language Understanding (NLU) tasks. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to reveal LLaMA3's low-bit quantization performance. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers non-negligent degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate and analyze the performance of the latest large - language model (LLM) - LLaMA3 and its multi - modal extension (MLLM) under low - bit quantization. Specifically, the main research questions include: 1. **Impact of quantization on performance**: Explore the impact of low - bit quantization (1 - 8 bits) on the performance of LLaMA3 in language tasks and vision - language tasks, especially the degradation of the model under extremely low - bit widths (such as 2 bits or less). 2. **Applicability of existing quantization methods**: Evaluate the effectiveness and limitations of multiple existing quantization methods (such as Post - Training Quantization, LoRA - FineTuning, etc.) when dealing with LLaMA3, and reveal the performance differences of these methods on the latest generation of LLM. 3. **Challenges of performance degradation**: Identify and analyze the specific reasons for performance degradation during the quantization process, especially for the extremely large - scale pre - trained LLaMA3 model, and how to maintain its high performance in resource - constrained scenarios. 4. **Specialties of multi - modal models**: Research the performance of multi - modal models based on LLaMA3 (such as LLaVA - Next - 8B) under low - bit quantization, and explore the performance changes in tasks such as visual question answering. Through these studies, the paper aims to provide valuable references for future quantization techniques of LLM and MLLM, promoting these models to reduce computational and memory requirements while maintaining high precision, thereby enhancing their practicality. ### Formula Examples Some quantization methods involved in the paper can be represented by the following formulas: - **Round - To - Nearest (RTN)**: \[ q=\text{round}\left(\frac{x}{s}\right) \] where \(x\) is the original weight, \(s\) is the scaling factor, and \(q\) is the quantized value. - **GPTQ (Group - wise Post - Training Quantization)**: \[ q = \arg\min_{q\in\mathcal{Q}}\left\|W - S\cdot Q\right\|_F^2 \] where \(W\) is the original weight matrix, \(S\) is the scaling factor matrix, \(Q\) is the quantized weight matrix, and \(\mathcal{Q}\) is the quantization domain. These formulas show the key steps in the quantization process, ensuring that the performance of the model after quantization is as close as possible to that of the original model.

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

How Good Are Low-bit Quantized LLaMA3 Models? an Empirical Study

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

LQER: Low-Rank Quantization Error Reconstruction for LLMs

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Evaluating Quantized Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

SqueezeLLM: Dense-and-Sparse Quantization