Abstract:Over the last years, deep neural networks (DNNs) are becoming more powerful and have risen in popularity, especially in mobile computing. Applications running on edge AI devices such as smartphones would potentially benefit from the new opportunities enabled by deep learning techniques. However, DNNs are by nature computationally and memory intensive, making them challenging to deploy on mobile devices. Binary neural networks (BNNs) have been considered as a promising solution that can significantly reduce the memory and computational requirements of DNNs while still offering similar capabilities of full precision DNN models. Currently, existing GPU-accelerated implementations of BNNs are only tailored for desktop platforms. Due to architecture differences, mere porting of such implementations to mobile devices yields suboptimal performance or is impossible in some cases. Therefore, there has still been a missing piece in the literature for GPU-accelerated implementations of BNNs on mobile devices. In this paper, we propose PhoneBit, a GPU-accelerated BNN inference engine for mobile devices that fully exploits the computing power of BNNs on mobile GPUs. PhoneBit provides a set of operator-level optimizations including locality-friendly data layout, bit packing with vectorization and layers integration for efficient binary convolution. We also provide a detailed implementation and parallelization optimization for PhoneBit to optimally utilize the memory bandwidth and computing power of mobile GPUs. Our experiment results show that PhoneBit can achieve significant speedup and energy efficiency compared with state-of-the-art frameworks for mobile devices. The PhoneBit open source library is available for download at <a href="https://code.ihub.org.cn/projects/915/repository/PhoneBit">https://code.ihub.org.cn/projects/915/repository/PhoneBit</a>.

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

BitNet a4.8: 4-bit Activations for 1-bit LLMs

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

BitNet: Scaling 1-bit Transformers for Large Language Models

SparseByteNN: A Novel Mobile Inference Acceleration Framework Based on Fine-Grained Group Sparsity

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

Inference Performance Optimization for Large Language Models on CPUs

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

Inference Acceleration for Large Language Models on CPUs

Distributed Inference Performance Optimization for LLMs on CPUs

PhoneBit: Efficient GPU-Accelerated Binary Neural Network Inference Engine for Mobile Phones

TCP-Net: Minimizing Operation Counts of Binarized Neural Network Inference.

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Bit-Quantized-Net: an Effective Method for Compressing Deep Neural Networks.

An efficient GPU-accelerated inference engine for binary neural network on mobile phones

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy