On the Efficiency of Convolutional Neural Networks

Andrew Lavin
2024-05-22
Abstract:Since the breakthrough performance of AlexNet in 2012, convolutional neural networks (convnets) have grown into extremely powerful vision models. Deep learning researchers have used convnets to perform vision tasks with accuracy that was unachievable a decade ago. Confronted with the immense computation that convnets use, deep learning researchers also became interested in efficiency. However, the engineers who deployed efficient convnets soon realized that they were slower than the previous generation, despite using fewer operations. Many reverted to older models that ran faster. Hence researchers switched the objective of their search from arithmetic complexity to latency and produced a new wave of models that performed better. Paradoxically, these models also used more operations. Skepticism grew among researchers and engineers alike about the relevance of arithmetic complexity. Contrary to the prevailing view that latency and arithmetic complexity are irreconcilable, a simple formula relates both through computational efficiency. This insight enabled us to co-optimize the separate factors that determine latency. We observed that the degenerate conv2d layers that produce the best accuracy--complexity trade-off also use significant memory resources and have low computational efficiency. We devised block fusion algorithms to implement all the layers of a residual block in a single kernel, thereby creating temporal locality, avoiding communication, and reducing workspace size. Our ConvFirst model with block-fusion kernels has less arithmetic complexity and greater computational efficiency than baseline models and kernels, and ran approximately four times as fast as ConvNeXt. We also created novel tools, including efficiency gap plots and waterline analysis. Our unified approach to convnet efficiency envisions a new era of models and kernels that achieve greater accuracy at lower cost.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the computational efficiency of convolutional neural networks (ConvNets), especially reducing the amount of computation while maintaining or improving the accuracy of the model. Specifically, the paper explores how to narrow the gap between ideal latency and actual latency, the so - called "efficiency gap", through optimizing algorithms and hardware design. The author points out that although the second - generation convolutional neural networks achieve higher model efficiency by reducing computational complexity, these models perform poorly in actual deployment due to low computational efficiency and even run more slowly than earlier models. Therefore, the paper proposes a new method to simultaneously optimize model efficiency and computational efficiency by introducing block - fusion technology, thereby significantly reducing latency while maintaining high accuracy. ### Main Contributions 1. **Proposing the Efficiency Gap Map**: This is a new visual tool that can simultaneously display model efficiency, arithmetic complexity, latency, and computational efficiency, helping to understand the impact of different factors on performance. 2. **Developing the Waterline Analysis**: It extends the traditional Roofline model to predict the performance of a series of parallel kernels, revealing the limitation of computational intensity on performance. 3. **Designing the Block - Fusion Algorithm**: By merging all layers of the residual block into one kernel, it reduces memory access and communication overhead and improves computational efficiency. 4. **Implementing the ConvFirst Model**: Combining the above techniques and CUDA kernels, a new convolutional neural network model is developed, which runs about four times faster than existing models at the same accuracy. ### Key Concepts - **Model Efficiency** ($E_m(n)$): Measures the accuracy of the model under a given amount of computation. - **Ideal Latency** ($t_i(n)=\frac{n}{R_i}$): The shortest response time that the model can theoretically achieve under a given amount of computation. - **Actual Latency** ($t_a(n)$): The response time of the model in the actual hardware and software environment. - **Computational Efficiency** ($C(n)=\frac{R_a(n)}{R_i}$): The ratio of the actual arithmetic throughput to the peak arithmetic throughput. - **Efficiency Gap**: The difference between ideal latency and actual latency, reflecting the level of computational efficiency. ### Experimental Results The paper verifies the performance of the ConvFirst model on the ImageNet - 1K classification task through experiments. The results show that ConvFirst is about four times faster than ConvNeXt at the same accuracy, demonstrating the effectiveness of the block - fusion technology in improving computational efficiency. In conclusion, through systematic methodology and technical means, this paper solves the efficiency problem of convolutional neural networks in practical applications and provides new ideas for the design of future efficient models.