Abstract:Deep neural networks (DNNs) and, in particular, convolutional neural networks (CNNs) have brought significant advances in a wide range of modern computer application problems. However, the increasing availability of large amounts of datasets as well as the increasing available computational power of modern computers lead to a steady growth in the complexity and size of DNN and CNN models, respectively, and thus, to longer training times. Hence, various methods and attempts have been developed to accelerate and parallelize the training of complex network architectures. In this work, a novel CNN-DNN architecture is proposed that naturally supports a model parallel training strategy and that is loosely inspired by two-level domain decomposition methods (DDM). First, local CNN models, that is, subnetworks, are defined that operate on overlapping or nonoverlapping parts of the input data, for example, sub-images. The subnetworks can be trained completely in parallel and independently of each other. Each subnetwork then outputs a local decision for the given machine learning problem which is exclusively based on the respective local input data. Subsequently, in a second step, an additional DNN model is trained which evaluates the local decisions of the local subnetworks and generates a final, global decision. In this paper, we apply the proposed architecture to image classification problems using CNNs. Experimental results for different 2D image classification problems are provided as well as a face recognition problem, and a classification problem for 3D computer tomography (CT) scans. Therefore, classical ResNet and VGG architectures are considered. The results show that the proposed approach can significantly accelerate the required training time compared to the global model and, additionally, can also help to improve the accuracy of the underlying classification problem.

What problem does this paper attempt to address?

This paper proposes a solution to the problem of long training time and high complexity of deep neural networks (DNN) and convolutional neural networks (CNN). With the increase of large datasets and computing power, network models are becoming larger, resulting in longer training time. In order to solve this problem, the paper introduces a CNN-DNN architecture based on domain decomposition for model parallel training. This architecture first decomposes the global CNN into multiple local CNN subnetworks, which can train independently and in parallel on overlapping or non-overlapping parts of the input data. Each subnetwork produces a local decision based on its local input data. Then, by training an additional DNN model, this model evaluates the decisions of the local subnetworks and generates the final global decision. This DNN can be seen as a coarse problem, and the whole method is similar to a two-layer domain decomposition approach. Experiments show that this method can significantly accelerate training time and may improve the accuracy of classification problems. The application range includes 2D image classification, face recognition, and 3D computer tomography (CT) image classification, etc. Although the paper mainly focuses on the classical ResNet and VGG architectures, it suggests considering more modern network structures such as MobileNet2 in the future. In summary, the goal of the paper is to propose a model parallel training strategy to speed up CNN training on GPU clusters while maintaining or improving classification accuracy.

A Domain Decomposition-Based CNN-DNN Architecture for Model Parallel Training Applied to Image Recognition Problems

Model Parallel Training and Transfer Learning for Convolutional Neural Networks by Domain Decomposition

Domain-decomposed image classification algorithms using linear discriminant analysis and convolutional neural networks

Model Parallelism Optimization for Distributed Inference Via Decoupled CNN Structure

A Bi-layered Parallel Training Architecture for Large-scale Convolutional Neural Networks

Decomposition and Composition of Deep Convolutional Neural Networks and Training Acceleration Via Sub-Network Transfer Learning

Optimizing DNN Training with Pipeline Model Parallelism for Enhanced Performance in Embedded Systems

Sparsing Deep Neural Network Using Semi-Discrete Matrix Decomposition

Parallel Convolutional Networks for Image Recognition via a Discriminator

Channel and filter parallelism for large-scale CNN training

Parallelizing Convolutional Neural Networks On Intel (R) Many Integrated Core Architecture

Implementation of Training Convolutional Neural Networks

Integrated Model, Batch and Domain Parallelism in Training Neural Networks

Adaptive Modular Convolutional Neural Network for Image Recognition.

EC-DNN: A New Method for Parallel Training of Deep Neural Networks.

DDU-Net: A Domain Decomposition-based CNN for High-Resolution Image Segmentation on Multiple GPUs

Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning

A Deep Recursive Cascaded Convolutional Network for Parallel MRI

Single binding of data and model parallelisms to parallelize convolutional neural networks through multiple machines.

Single Image Super-Resolution Using a Polymorphic Parallel CNN