Abstract:With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.29$\times$ and 1.25$\times$ speed-ups in Gemma 7B and Llama 2 13B, respectively, on an NVIDIA L40). Code is available at <a class="link-external link-https" href="https://github.com/hdong920/GRIFFIN" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
This paper attempts to solve the problems of high computational cost and memory requirements faced by large - language models (LLMs) during deployment. Specifically, the paper points out that although large - language models with the Transformer architecture are widely used in multiple fields due to their excellent practicality, the deployment of these models requires a large amount of computational resources and memory. In particular, there are a large number of useless or minimally - influential computations in the Feedforward (FF) blocks, leading to a waste of resources. For example, in the OPT - 175B model, more than 95% of the neuron values in the FF block for each token are zero, which means that most of the computations are carried out on meaningless features.
To solve these problems, the paper proposes GRIFFIN, a method that requires no training and calibration. It selects FF experts by leveraging the sequence - level "clustering" phenomenon to achieve efficient generation. "Clustering" refers to the highly - structured nature that the FF activation patterns within a sequence naturally exhibit in many trained LLMs. The core of the GRIFFIN method lies in the observation that for a given sequence, the relative intensities of activation between different tokens are shared, which makes it possible to reduce the number of FF parameters without sacrificing performance and improve the model's inference speed.
The specific contributions of GRIFFIN include:
1. **No Preparation Required**: GRIFFIN is a low - cost method that does not require any additional training or calibration and can be easily integrated into existing FF blocks.
2. **Simple Expert Selection**: By analyzing the "clustering" phenomenon in the prompt, GRIFFIN can determine the most relevant FF neurons during the generation process with almost no performance loss.
3. **Model and Activation Function Diversity**: Experiments show that GRIFFIN is applicable to multiple models, including Llama 2, Gemma, Mistral, OPT, and ReluLlama, and supports multiple activation functions such as ReLU, SwiGLU, GEGLU, and ReGLU.
Through these improvements, GRIFFIN can not only maintain the original performance of the model after removing 50% of the FF neurons but also significantly reduce latency. For example, it achieves a 1.25 - fold and 1.29 - fold speed - up on the Llama 2 13B and Gemma 7B models respectively. In addition, GRIFFIN also demonstrates excellent scalability and robustness.