Abstract:Convolutional Neural Networks (CNNs) are among the most prevalent deep learning techniques employed across various domains. The computational complexity of CNNs is largely attributed to the convolution operations. These operations are computationally demanding and significantly impact overall model performance. Traditional CNN implementations convert convolutions into matrix operations via the im2col (image to column) technique, facilitating parallelization through advanced BLAS libraries. This study identifies and investigates a significant yet intricate pattern of data redundancy within the matrix-based representation of convolutions, a pattern that, while complex, presents opportunities for optimization. Through meticulous analysis of the redundancy inherent in the im2col approach, this paper introduces a mathematically succinct matrix representation for convolution, leading to the development of an optimized FFT-based convolution with finer FFT granularity. Benchmarking demonstrates that our approach achieves an average speedup of 14 times and a maximum speedup of 17 times compared to the regular FFT convolution. Similarly, it outperforms the Im2col+GEMM approach from NVIDIA's cuDNN library, achieving an average speedup of three times and a maximum speedup of five times. Our FineGrained FFT convolution approach, when integrated into Caffe, a widely used deep learning framework, leads to significant performance gains. Evaluations using synthetic CNNs designed for real-world applications show an average speedup of 1.67 times. Furthermore, a modified VGG network variant achieves a speedup of 1.25 times.

DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos

Deep Neural Network Acceleration with Sparse Prediction Layers

SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity

SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training

Fast CNN Inference by Adaptive Sparse Matrix Decomposition.

A GPU-based high-performance optimization method of sparse convolutional neural networks

ResMap: Exploiting Sparse Residual Feature Map for Accelerating Cross-Edge Video Analytics.

Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier Transforms

Adaptive Pixel-wise Structured Sparse Network for Efficient CNNs

Recurrent Residual Module for Fast Inference in Videos

Accelerating convolutional neural network by exploiting sparsity on GPUs

UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

A Computing Efficient Hardware Architecture for Sparse Deep Neural Network Computing

Sparse Kronecker Canonical Polyadic Decomposition for Convolutional Neural Networks Compression

VSCNN: Convolution Neural Network Accelerator With Vector Sparsity

Fast Cnn Pruning Via Redundancy-Aware Training

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-Local Spatial-Temporal Similarity

Sparsity Invariant CNNs

A 65-Nm Energy-Efficient Interframe Data Reuse Neural Network Accelerator for Video Applications

Don't Think It Twice: Exploit Shift Invariance for Efficient Online Streaming Inference of CNNs