Abstract:With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.29$\times$ and 1.25$\times$ speed-ups in Gemma 7B and Llama 2 13B, respectively, on an NVIDIA L40). Code is available at <a class="link-external link-https" href="https://github.com/hdong920/GRIFFIN" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problems of high computational cost and memory requirements faced by large - language models (LLMs) during deployment. Specifically, the paper points out that although large - language models with the Transformer architecture are widely used in multiple fields due to their excellent practicality, the deployment of these models requires a large amount of computational resources and memory. In particular, there are a large number of useless or minimally - influential computations in the Feedforward (FF) blocks, leading to a waste of resources. For example, in the OPT - 175B model, more than 95% of the neuron values in the FF block for each token are zero, which means that most of the computations are carried out on meaningless features. To solve these problems, the paper proposes GRIFFIN, a method that requires no training and calibration. It selects FF experts by leveraging the sequence - level "clustering" phenomenon to achieve efficient generation. "Clustering" refers to the highly - structured nature that the FF activation patterns within a sequence naturally exhibit in many trained LLMs. The core of the GRIFFIN method lies in the observation that for a given sequence, the relative intensities of activation between different tokens are shared, which makes it possible to reduce the number of FF parameters without sacrificing performance and improve the model's inference speed. The specific contributions of GRIFFIN include: 1. **No Preparation Required**: GRIFFIN is a low - cost method that does not require any additional training or calibration and can be easily integrated into existing FF blocks. 2. **Simple Expert Selection**: By analyzing the "clustering" phenomenon in the prompt, GRIFFIN can determine the most relevant FF neurons during the generation process with almost no performance loss. 3. **Model and Activation Function Diversity**: Experiments show that GRIFFIN is applicable to multiple models, including Llama 2, Gemma, Mistral, OPT, and ReluLlama, and supports multiple activation functions such as ReLU, SwiGLU, GEGLU, and ReGLU. Through these improvements, GRIFFIN can not only maintain the original performance of the model after removing 50% of the FF neurons but also significantly reduce latency. For example, it achieves a 1.25 - fold and 1.29 - fold speed - up on the Llama 2 13B and Gemma 7B models respectively. In addition, GRIFFIN also demonstrates excellent scalability and robustness.

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

LLM-Pruner: On the Structural Pruning of Large Language Models

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Fluctuation-based Adaptive Structured Pruning for Large Language Models

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

Less is More: Towards Green Code Large Language Models via Unified Structural Pruning

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

PAT: Pruning-Aware Tuning for Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment