Abstract:Deploying local AI models, such as Large Language Models (LLMs), to edge devices can substantially enhance devices' independent capabilities, alleviate the server's burden, and lower the response time. Owing to these tremendous potentials, many big tech companies have released several lightweight Small Language Models (SLMs) to bridge this gap. However, we still have huge motivations to deploy more powerful (LLMs) AI models on edge devices and enhance their smartness level. Unlike the conventional approaches for AI model compression, we investigate activation sparsity. The activation sparsity method is orthogonal and combinable with existing techniques to maximize compression rate while maintaining great accuracy. LLMs' Feed-Forward Network (FFN) components, which typically comprise a large proportion of parameters (around 3/2), ensure that our FFN optimizations would have a better chance of achieving effective compression. Moreover, our findings are beneficial to general LLMs and are not restricted to ReLU-based models. This work systematically investigates the tradeoff between enforcing activation sparsity and perplexity (accuracy) on state-of-the-art LLMs. Our empirical analysis demonstrates that we can obtain around 50% of main memory and computing reductions for critical FFN components with negligible accuracy degradation. This extra 50% sparsity does not naturally exist in the current LLMs, which require tuning LLMs' activation outputs by injecting zero-enforcing thresholds. To obtain the benefits of activation sparsity, we provide a guideline for the system architect for LLM prediction and prefetching. The success prediction allows the system to prefetch the necessary weights while omitting the inactive ones and their successors, therefore lowering cache and memory pollution and reducing LLM execution time on resource-constrained edge devices.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore how to compress large - scale language models (LLMs) through activation sparsity, so as to deploy more powerful AI models on resource - constrained edge devices. Specifically, the paper addresses the following key issues: 1. **Deployment of large - scale language models on edge devices**: - Current large - scale language models (LLMs) are difficult to be directly deployed on edge devices due to high computational and memory requirements. Although some companies have released small - scale language models (SLMs), the application scope of these models is limited. - The paper proposes a new compression method - reducing the memory footprint and computational requirements of the model through activation sparsity, so that larger - scale language models can run on edge devices. 2. **Limitations of existing compression techniques**: - Traditional model compression methods such as pruning, quantization, and knowledge distillation are effective, but they encounter bottlenecks after reaching a certain compression rate. - The paper explores activation sparsity as a new compression approach that can be combined with existing techniques to further improve the compression rate. 3. **Research on activation sparsity**: - Activation sparsity means that the output of some neurons in the neural network is zero, which can reduce unnecessary computation and memory access. - The paper finds that the current state - of - the - art LLMs (such as models with the SwiGLU activation function) have low natural activation sparsity, so it is necessary to enforce activation sparsity by introducing a threshold. 4. **Prediction of activation patterns**: - In order to fully utilize the benefits brought by activation sparsity, the paper also studies the predictability of activation patterns. By predicting which neurons will be activated, the system can load only the necessary weights during the inference process, thereby reducing memory pollution and inference latency. ### Main contributions - **Exploration of activation sparsity**: Analyzed activation sparsity and weight sparsity of the latest LLMs, especially for those models with low natural activation sparsity. - **Empirical analysis**: Verified through experiments that about 50% activation sparsity can be obtained in the FFN layer while maintaining acceptable accuracy. - **Activation pattern prediction**: Studied the similarity and predictability of activation patterns, and proposed a method to predict activation patterns based on input tokens to optimize the use of memory and computational resources. Through these studies, the paper provides guidance for system architects to design more efficient LLM inference systems, especially on resource - constrained edge devices.

Activation Sparsity Opportunities for Compressing General Large Language Models

Learn To be Efficient: Build Structured Sparsity in Large Language Models

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Training-Free Activation Sparsity in Large Language Models

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Aggressive Post-Training Compression on Extremely Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Achieving Sparse Activation in Small Language Models

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Search for Efficient Large Language Models

Large Language Model Compression with Neural Architecture Search

CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

Sparsity-Accelerated Training for Large Language Models

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference

SparseLLM: Towards Global Pruning for Pre-trained Language Models

LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time