Abstract:Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to significantly reduce the number of neurons required by the language model during the inference process without significantly degrading performance, thereby achieving exponential acceleration. Specifically, the authors proposed the UltraFastBERT model, which achieves a significant speed improvement by replacing the traditional feed - forward networks (FFs) with fast feed - forward networks (FFFs) and performing inference using only a very small fraction of neurons. ### Main Problems and Solutions 1. **Problems**: - Traditional large - scale language models (such as BERT) need to activate a large number of neurons during inference, resulting in a waste of computing resources and slow inference speed. - Although these models have a large number of parameters, in actual inference, not all neurons need to participate in the calculation. 2. **Solutions**: - A new model architecture, UltraFastBERT, is proposed. This model uses fast feed - forward networks (FFFs) instead of traditional feed - forward networks (FFs) in the intermediate layers. - FFFs only selectively activate a small number of neurons for inference through the conditional matrix multiplication (CMM) mechanism, thereby significantly reducing the amount of computation. - Specifically, UltraFastBERT uses only 0.3% of the neurons during inference, while its performance is comparable to that of a traditional BERT model of the same size. ### Key Technical Points - **Fast Feed - Forward Networks (FFFs)**: - FFFs organize neurons into a balanced binary tree structure, and only execute the neurons on one path during each inference. - This reduces the time complexity of forward propagation from O(n) to O(log₂n), where n is the number of neurons. - **Conditional Matrix Multiplication (CMM)**: - CMM is a matrix multiplication method that selectively activates neurons based on input conditions. - Sparse computation is achieved by calculating the dot product of the input and the weight matrix row by row and selecting the weight columns of the next row according to the results. - **Implementation and Acceleration**: - The paper provides implementation codes on CPU and GPU, showing the acceleration effects under different levels of optimization. - On the CPU, a maximum of 78 - fold acceleration is achieved; on the GPU, 40 - fold acceleration is also achieved. ### Summary The main contribution of this paper is to prove that large - scale language models only need to activate a small fraction of neurons during inference to maintain high performance, and to demonstrate the practical feasibility of this theory through the UltraFastBERT model. This provides new ideas and directions for the future development of more efficient deep - learning models and hardware accelerators.

Exponentially Faster Language Modelling

FEDBFPT: an Efficient Federated Learning Framework for BERT Further Pre-Training

Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

FastBERT: a Self-distilling BERT with Adaptive Inference Time

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Accelerating BERT inference with GPU-efficient exit prediction

{E}fficient{BERT}: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation

Fast DistilBERT on CPUs

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

The Right Tool for the Job: Matching Model and Instance Complexities

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference.

PF-BERxiT: Early Exiting for BERT with Parameter-Efficient Fine-Tuning and Flexible Early Exiting Strategy.

Elbert: Fast Albert with Confidence-Window Based Early Exit

A Multi-Level Framework for Accelerating Training Transformer Models

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method.

Fast and Accurate FSA System Using ELBERT: An Efficient and Lightweight BERT

SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate via Compiler Co-design

EarlyBERT: Efficient BERT Training Via Early-bird Lottery Tickets

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?