Parallel Blockwise Knowledge Distillation for Deep Neural Network Compression

Cody Blakeney,Xiaomin Li,Yan Yan,Ziliang Zong
DOI: https://doi.org/10.48550/arXiv.2012.03096
2020-12-06
Abstract:Deep neural networks (DNNs) have been extremely successful in solving many challenging AI tasks in natural language processing, speech recognition, and computer vision nowadays. However, DNNs are typically computation intensive, memory demanding, and power hungry, which significantly limits their usage on platforms with constrained resources. Therefore, a variety of compression techniques (e.g. quantization, pruning, and knowledge distillation) have been proposed to reduce the size and power consumption of DNNs. Blockwise knowledge distillation is one of the compression techniques that can effectively reduce the size of a highly complex DNN. However, it is not widely adopted due to its long training time. In this paper, we propose a novel parallel blockwise distillation algorithm to accelerate the distillation process of sophisticated DNNs. Our algorithm leverages local information to conduct independent blockwise distillation, utilizes depthwise separable layers as the efficient replacement block architecture, and properly addresses limiting factors (e.g. dependency, synchronization, and load balancing) that affect parallelism. The experimental results running on an AMD server with four Geforce RTX 2080Ti GPUs show that our algorithm can achieve 3x speedup plus 19% energy savings on VGG distillation, and 3.5x speedup plus 29% energy savings on ResNet distillation, both with negligible accuracy loss. The speedup of ResNet distillation can be further improved to 3.87 when using four RTX6000 GPUs in a distributed cluster.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that deep neural networks (DNNs) have excessive requirements in terms of computation, memory, and energy consumption, which limits their applications on resource - constrained platforms such as mobile devices and Internet - of - Things devices. To address this problem, researchers have proposed various compression techniques, such as quantization, pruning, and knowledge distillation. However, although existing block - level knowledge distillation methods can effectively reduce the size of complex DNNs, they are not widely used due to their long training time. Specifically, the paper proposes a new parallel block - level knowledge distillation algorithm, aiming to accelerate the knowledge distillation process of complex DNNs. By using local information for independent block - level distillation and using depth - separable layers as an efficient alternative block architecture while addressing limiting factors affecting parallelism (such as dependency, synchronization, and load balancing), this algorithm can significantly reduce training time and energy consumption while maintaining model accuracy. ### Main contributions of the paper 1. **Accelerated training**: By parallelizing block - level knowledge distillation, training time is reduced. Experimental results show that 3 - fold and 3.5 - fold speed - ups are achieved in the distillation processes of VGG and ResNet respectively. 2. **Energy - saving**: Compared with traditional methods, the new algorithm saves 19% and 29% of energy consumption on VGG and ResNet respectively. 3. **Efficient parallelization**: Through task parallelization, the synchronization overhead of different tasks on multiple GPUs is minimized, ensuring efficient parallel execution. 4. **Elasticity and scalability**: Users can easily scale to more GPUs without adjusting hyper - parameters, ensuring the elasticity and scalability of the algorithm. ### Technical details - **Independent block - level distillation**: The training tasks of each block are independent of each other, reducing dependencies and allowing multiple blocks to be trained simultaneously. - **Depth - separable layers**: Used as an alternative block architecture, significantly reducing the amount of computation and memory usage. - **Parallel scheduling algorithms**: Including strategies such as round - robin, bin - packing, and work - stealing to optimize task allocation and load balancing. Through these improvements, the method proposed in the paper not only improves training efficiency but also provides a more practical solution for the compression of deep neural networks.