Exponentially Faster Language Modelling

Peter Belcak,Roger Wattenhofer
2023-11-21
Abstract:Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.
Computation and Language,Artificial Intelligence,Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to significantly reduce the number of neurons required by the language model during the inference process without significantly degrading performance, thereby achieving exponential acceleration. Specifically, the authors proposed the UltraFastBERT model, which achieves a significant speed improvement by replacing the traditional feed - forward networks (FFs) with fast feed - forward networks (FFFs) and performing inference using only a very small fraction of neurons. ### Main Problems and Solutions 1. **Problems**: - Traditional large - scale language models (such as BERT) need to activate a large number of neurons during inference, resulting in a waste of computing resources and slow inference speed. - Although these models have a large number of parameters, in actual inference, not all neurons need to participate in the calculation. 2. **Solutions**: - A new model architecture, UltraFastBERT, is proposed. This model uses fast feed - forward networks (FFFs) instead of traditional feed - forward networks (FFs) in the intermediate layers. - FFFs only selectively activate a small number of neurons for inference through the conditional matrix multiplication (CMM) mechanism, thereby significantly reducing the amount of computation. - Specifically, UltraFastBERT uses only 0.3% of the neurons during inference, while its performance is comparable to that of a traditional BERT model of the same size. ### Key Technical Points - **Fast Feed - Forward Networks (FFFs)**: - FFFs organize neurons into a balanced binary tree structure, and only execute the neurons on one path during each inference. - This reduces the time complexity of forward propagation from O(n) to O(log₂n), where n is the number of neurons. - **Conditional Matrix Multiplication (CMM)**: - CMM is a matrix multiplication method that selectively activates neurons based on input conditions. - Sparse computation is achieved by calculating the dot product of the input and the weight matrix row by row and selecting the weight columns of the next row according to the results. - **Implementation and Acceleration**: - The paper provides implementation codes on CPU and GPU, showing the acceleration effects under different levels of optimization. - On the CPU, a maximum of 78 - fold acceleration is achieved; on the GPU, 40 - fold acceleration is also achieved. ### Summary The main contribution of this paper is to prove that large - scale language models only need to activate a small fraction of neurons during inference to maintain high performance, and to demonstrate the practical feasibility of this theory through the UltraFastBERT model. This provides new ideas and directions for the future development of more efficient deep - learning models and hardware accelerators.