Abstract:As machine learning gets deployed more and more widely, and model sizes continue to grow, improving computational efficiency during model inference has become a key challenge. In many commonly used model architectures, including Transformers, a significant portion of the inference computation is comprised of exponential non-linearities such as Softmax. In this work, we develop QuAKE, a collection of novel operators that leverage certain properties of IEEE-754 floating point representations to quickly approximate the exponential function without requiring specialized hardware, extra memory, or precomputation. We propose optimizations that enhance the efficiency of QuAKE in commonly used exponential non-linearities such as Softmax, GELU, and the Logistic function. Our benchmarks demonstrate substantial inference speed improvements between 10% and 35% on server CPUs, and 5% and 45% on embedded and mobile-scale CPUs for a variety of model architectures and sizes. Evaluations of model performance on standard datasets and tasks from various domains show that QuAKE operators are able to provide sizable speed benefits with little to no loss of performance on downstream tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the computational efficiency in the inference phase of machine - learning models, especially when it comes to calculations involving exponential non - linearities (such as Softmax, GELU, and Logistic functions). As the scale of machine - learning models continues to increase, the consumption of computational resources during the inference process has become a key challenge. To this end, the authors propose QuAKE (Quick and Approximate Kernels for Exponentiation), a set of new operators that take advantage of the characteristics of IEEE - 754 floating - point representation to quickly approximate exponential functions, thereby significantly improving the inference speed without the need for dedicated hardware, extra memory, or pre - computation. Specifically, the paper mainly focuses on the following aspects: 1. **Improving inference speed**: By optimizing exponential non - linear calculations, reduce the inference time, especially on server CPUs and embedded / mobile devices. 2. **Maintaining model performance**: Ensure that after using the QuAKE operators, the performance of the model on various downstream tasks does not decline significantly. 3. **Wide applicability**: QuAKE is not only applicable to specific types of models, but can also be widely used in different neural network architectures, including Transformer, Convolutional Neural Network (CNN), etc. ### Main contributions 1. **Design of QuAKE operators**: - Taking advantage of the characteristics of IEEE - 754 floating - point representation, a method for quickly approximating exponential functions is proposed. - Through affine transformation and quadratic correction, the accuracy and efficiency of the approximation are further improved. 2. **Extensive experimental verification**: - Benchmark tests were carried out on multiple hardware platforms, demonstrating the acceleration effect of QuAKE on different hardware. - The model performance on multiple standard datasets was evaluated, verifying that QuAKE provides a significant speed boost while maintaining model performance. 3. **Combination of theory and practice**: - The mathematical principles of QuAKE were analyzed in detail, and its effectiveness in practical applications was proven through experiments. - QuAKE2, an improved version of QuAKE, was proposed, which can further improve the speed while maintaining high precision. ### Formula presentation The key formulas involved in the paper are as follows: - **Floating - point representation**: \[ |x| = 2^{x_e}(1 + x_m) \] where \( x_e \) is the exponent part and \( x_m \) is the mantissa part. - **Exponential approximation**: \[ 2^x \approx 2^{\lfloor x \rfloor} (1 + \{ x \}) \] where \( \lfloor x \rfloor \) is the integer part of \( x \) and \( \{ x \} \) is the fractional part of \( x \). - **General form of QuAKE**: \[ z = c_0 x + c_1, \quad c_0 = 2^{l_m} p, \quad c_1 = 2^{l_m} (B + q) \] - **Approximation of GELU activation function**: \[ \text{GELU}(x) \approx 0.5x \left( 1 + \tanh\left[ \sqrt{\frac{2}{\pi}} \left( x + 0.044715 x^3 \right) \right] \right) \] Through these methods, QuAKE can significantly improve the inference speed while maintaining the model performance, especially suitable for the deployment and application of large - scale machine - learning models.

QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile.

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

KDEformer: Accelerating Transformers via Kernel Density Estimation

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference

Improving Transformer Inference Through Optimized Non-Linear Operations with Quantization-Approximation-Based Strategy

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

QuACK: Accelerating Gradient-Based Quantum Optimization with Koopman Operator Learning

A Speed Odyssey for Deployable Quantization of LLMs

Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance

QUACK: Quantum Aligned Centroid Kernel

TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

KV Prediction for Improved Time to First Token

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Sparks of Quantum Advantage and Rapid Retraining in Machine Learning

A Model for Circuit Execution Runtime And Its Implications for Quantum Kernels At Practical Data Set Sizes

Faster variational quantum algorithms with quantum kernel-based surrogate models

Accuracy and Performance of Functional Parameter Estimation Using a Novel Numerical Optimization Approach for GPU-Based Kinetic Compartmental Modeling

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM