LookupFFN: Making Transformers Compute-lite for CPU inference

Zhanpeng Zeng,Michael Davies,Pranav Pulijala,Karthikeyan Sankaralingam,Vikas Singh

2024-03-12

Abstract:While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry. But the imbalance between the compute capabilities of GPUs and CPUs is huge. Motivated by these considerations, we study a module which is a workhorse within modern DNN architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent to which it can be made compute- (or FLOP-) lite. Specifically, we propose an alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing (LSH) to approximate FFNs. Our formulation recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory (since CPUs offer it in abundance). For RoBERTa language model pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency -- not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at \url{https://github.com/mlpen/LookupFFN}.

Machine Learning

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the computational bottleneck encountered when performing large-scale deep neural network (DNN) inference on CPUs. Specifically: 1. **Imbalance between CPU and GPU computational capabilities**: - Currently, GPU clusters are the de facto standard for training large-scale DNN models. However, due to workflow simplicity, security, and cost factors, researchers are exploring whether CPUs can be used for routine inference tasks. - However, there is a significant gap in computational capabilities between CPUs and GPUs, necessitating a method to improve computational efficiency on CPUs. 2. **Computationally intensive problem of feedforward neural networks (FFN) based on GEMM**: - FFNs are crucial components in modern DNN architectures, especially in Transformer models, and they rely on general matrix multiplication (GEMM), which is particularly resource-intensive in large-scale models. - To reduce the required floating-point operations (FLOP), the paper proposes a new alternative—LookupFFN, inspired by locality-sensitive hashing (LSH), which transforms key operations into memory lookups. 3. **Limitations of existing methods**: - The paper analyzes some existing methods that use LSH to approximate FFNs (such as Slide, Mongoose, etc.) and identifies issues, such as the need for a large number of hash functions to achieve accurate results and the requirement to continuously update hash tables during training, leading to significant computational overhead. - The proposed method aims to completely avoid the need for rehashing and optimizes hash functions and hash tables through end-to-end learning, thereby reducing computational burden. In summary, the paper aims to improve inference performance on CPUs and reduce computational demands by proposing a new FFN implementation—LookupFFN.

LookupFFN: Making Transformers Compute-lite for CPU inference

Inference Performance Optimization for Large Language Models on CPUs

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup

Utilizing cloud FPGAs towards the open neural network standard

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization

Efficient LLM inference solution on Intel GPU

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic

FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing