Abstract:Since the breakthrough performance of AlexNet in 2012, convolutional neural networks (convnets) have grown into extremely powerful vision models. Deep learning researchers have used convnets to perform vision tasks with accuracy that was unachievable a decade ago. Confronted with the immense computation that convnets use, deep learning researchers also became interested in efficiency. However, the engineers who deployed efficient convnets soon realized that they were slower than the previous generation, despite using fewer operations. Many reverted to older models that ran faster. Hence researchers switched the objective of their search from arithmetic complexity to latency and produced a new wave of models that performed better. Paradoxically, these models also used more operations. Skepticism grew among researchers and engineers alike about the relevance of arithmetic complexity. Contrary to the prevailing view that latency and arithmetic complexity are irreconcilable, a simple formula relates both through computational efficiency. This insight enabled us to co-optimize the separate factors that determine latency. We observed that the degenerate conv2d layers that produce the best accuracy--complexity trade-off also use significant memory resources and have low computational efficiency. We devised block fusion algorithms to implement all the layers of a residual block in a single kernel, thereby creating temporal locality, avoiding communication, and reducing workspace size. Our ConvFirst model with block-fusion kernels has less arithmetic complexity and greater computational efficiency than baseline models and kernels, and ran approximately four times as fast as ConvNeXt. We also created novel tools, including efficiency gap plots and waterline analysis. Our unified approach to convnet efficiency envisions a new era of models and kernels that achieve greater accuracy at lower cost.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the computational efficiency of convolutional neural networks (ConvNets), especially reducing the amount of computation while maintaining or improving the accuracy of the model. Specifically, the paper explores how to narrow the gap between ideal latency and actual latency, the so - called "efficiency gap", through optimizing algorithms and hardware design. The author points out that although the second - generation convolutional neural networks achieve higher model efficiency by reducing computational complexity, these models perform poorly in actual deployment due to low computational efficiency and even run more slowly than earlier models. Therefore, the paper proposes a new method to simultaneously optimize model efficiency and computational efficiency by introducing block - fusion technology, thereby significantly reducing latency while maintaining high accuracy. ### Main Contributions 1. **Proposing the Efficiency Gap Map**: This is a new visual tool that can simultaneously display model efficiency, arithmetic complexity, latency, and computational efficiency, helping to understand the impact of different factors on performance. 2. **Developing the Waterline Analysis**: It extends the traditional Roofline model to predict the performance of a series of parallel kernels, revealing the limitation of computational intensity on performance. 3. **Designing the Block - Fusion Algorithm**: By merging all layers of the residual block into one kernel, it reduces memory access and communication overhead and improves computational efficiency. 4. **Implementing the ConvFirst Model**: Combining the above techniques and CUDA kernels, a new convolutional neural network model is developed, which runs about four times faster than existing models at the same accuracy. ### Key Concepts - **Model Efficiency** ($E_m(n)$): Measures the accuracy of the model under a given amount of computation. - **Ideal Latency** ($t_i(n)=\frac{n}{R_i}$): The shortest response time that the model can theoretically achieve under a given amount of computation. - **Actual Latency** ($t_a(n)$): The response time of the model in the actual hardware and software environment. - **Computational Efficiency** ($C(n)=\frac{R_a(n)}{R_i}$): The ratio of the actual arithmetic throughput to the peak arithmetic throughput. - **Efficiency Gap**: The difference between ideal latency and actual latency, reflecting the level of computational efficiency. ### Experimental Results The paper verifies the performance of the ConvFirst model on the ImageNet - 1K classification task through experiments. The results show that ConvFirst is about four times faster than ConvNeXt at the same accuracy, demonstrating the effectiveness of the block - fusion technology in improving computational efficiency. In conclusion, through systematic methodology and technical means, this paper solves the efficiency problem of convolutional neural networks in practical applications and provides new ideas for the design of future efficient models.

On the Efficiency of Convolutional Neural Networks

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Structured Convolutions for Efficient Neural Network Design

A Survey on Efficient Convolutional Neural Networks and Hardware Acceleration

Comb Convolution for Efficient Convolutional Architecture

Effnet: An Efficient Structure for Convolutional Neural Networks

Optimizing Convolutional Neural Networks on Multi-Core Vector Accelerator

InceptionNeXt: When Inception Meets ConvNeXt

A High Efficient Architecture for Convolution Neural Network Accelerator

Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier Transforms

ON-CNN: Low Latency and High Throughput Online Arithmetic-Based Convolutional Neural Network Accelerator

Improving Efficiency in Convolutional Neural Network with Multilinear Filters

Layer-Wise Training To Create Efficient Convolutional Neural Networks

A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator.

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

Designing efficient convolutional neural network structure: A survey

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

Efficient Convolution Architectures for Convolutional Neural Network

Efficient Hardware Architectures for Deep Convolutional Neural Network