Abstract:Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at <a class="link-external link-https" href="https://github.com/GATECH-EIC/ShiftAddLLM" rel="external noopener nofollow">this https URL</a>.

Nanolm: an Affordable LLM Pre-training Benchmark Via Accurate Loss Prediction Across Scales

Scaling Laws for Predicting Downstream Performance in LLMs

Temporal Scaling Law for Large Language Models

Language models scale reliably with over-training and on downstream tasks

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

FLM-101B: An Open LLM and How to Train It with $100K Budget

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

Predicting Emergent Capabilities by Finetuning

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees

LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments