Abstract:Deploying local AI models, such as Large Language Models (LLMs), to edge devices can substantially enhance devices' independent capabilities, alleviate the server's burden, and lower the response time. Owing to these tremendous potentials, many big tech companies have released several lightweight Small Language Models (SLMs) to bridge this gap. However, we still have huge motivations to deploy more powerful (LLMs) AI models on edge devices and enhance their smartness level. Unlike the conventional approaches for AI model compression, we investigate activation sparsity. The activation sparsity method is orthogonal and combinable with existing techniques to maximize compression rate while maintaining great accuracy. LLMs' Feed-Forward Network (FFN) components, which typically comprise a large proportion of parameters (around 3/2), ensure that our FFN optimizations would have a better chance of achieving effective compression. Moreover, our findings are beneficial to general LLMs and are not restricted to ReLU-based models. This work systematically investigates the tradeoff between enforcing activation sparsity and perplexity (accuracy) on state-of-the-art LLMs. Our empirical analysis demonstrates that we can obtain around 50% of main memory and computing reductions for critical FFN components with negligible accuracy degradation. This extra 50% sparsity does not naturally exist in the current LLMs, which require tuning LLMs' activation outputs by injecting zero-enforcing thresholds. To obtain the benefits of activation sparsity, we provide a guideline for the system architect for LLM prediction and prefetching. The success prediction allows the system to prefetch the necessary weights while omitting the inactive ones and their successors, therefore lowering cache and memory pollution and reducing LLM execution time on resource-constrained edge devices.

Attention is Naturally Sparse with Gaussian Distributed Input

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Attention With Sparsity Regularization for Neural Machine Translation and Summarization

HSR-Enhanced Sparse Attention Acceleration

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention

Efficient Sparse Attention needs Adaptive Token Release

Loki: Low-Rank Keys for Efficient Sparse Attention

Anchor Attention, Small Cache: Code Generation with Large Language Models

SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

Activation Sparsity Opportunities for Compressing General Large Language Models

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Fast Quantum Algorithm for Attention Computation

On the Expressive Power of Self-Attention Matrices