Abstract:In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

FastTuning: Enabling Fast and Efficient Hyper-Parameter Tuning with Partitioning and Parallelism of Search Space

Parallel Training of Pre-Trained Models Via Chunk-Based Dynamic Memory Management

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

PetS: A Unified Framework for Parameter-Efficient Transformers Serving

Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Efficient Federated Prompt Tuning for Black-box Large Pre-trained Models

PETPS: Supporting Huge Embedding Models with Persistent Memory

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent

A Multi-Level Framework for Accelerating Training Transformer Models

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

FedPT: Federated Proxy-Tuning of Large Language Models on Resource-Constrained Edge Devices

SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification