Abstract:In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization

LongCoder: A Long-Range Pre-trained Language Model for Code Completion

Understanding Long Programming Languages with Structure-Aware Sparse Attention

Tackling Long Code Search with Splitting, Encoding, and Aggregating

CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Efficient Sparse Attention needs Adaptive Token Release

TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills.

Learned Token Pruning for Transformers

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Generating Long Sequences with Sparse Transformers

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models

SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate via Compiler Co-design

TransformCode: A Contrastive Learning Framework for Code Embedding Via Subtree Transformation

Sparse Attention-Based Neural Networks for Code Classification

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

ALPINE: An adaptive language-agnostic pruning method for language models for code