Abstract:......................................................................................................................... 3 1 Introduction ............................................................................................................. 3 1.1 Application Background ............................................................................... 4 1.2 Performance Demands for Deep Learning ................................................... 4 1.3 Existing Parallel Frameworks of Deep Learning ......................................... 4 1.4 Chapter Organization ................................................................................... 5 2 Concepts and Categories of Deep Learning ............................................................ 5 2.1 Deep Learning ............................................................................................. 5 2.1.1 Artificial Neural Networks ................................................................ 5 2.1.2 Concept of Deep Learning ................................................................. 7 2.2 Mainstream Deep Learning Models ............................................................. 8 2.2.1 Autoencoders ..................................................................................... 8 2.2.2 Back Propagation ............................................................................... 9 2.2.3 Convolutional Neural Network ........................................................ 11 3 Parallel Optimization for Deep Learning .............................................................. 13 3.1 Convolutional Architecture for Fast Feature Embedding ......................... 13 3.1.1 Introduction ...................................................................................... 13 3.1.2 CUDA Programming ....................................................................... 14 3.1.3 Architecture of Caffe ....................................................................... 17 3.1.4 Parallel Implementation of Convolution in Caffe ............................ 18 3.2 DistBelief .................................................................................................. 20 3.2.1 Introduction of DistBelief ................................................................ 20 3.2.2 Downpour SGD ............................................................................... 20 3.2.4 Sandblaster L-BFGS ........................................................................ 21 3.3 Deep Learning Based-on Multi-GPUs ...................................................... 22 3.3.1 Data Parallelism ............................................................................... 22 3.3.2 Model Parallelism ............................................................................ 23 3.3.3 Data-Model Parallelism ................................................................... 24 3.3.4 Example System of Multi-GPUs ..................................................... 25 4 Discussions ........................................................................................................... 26 4.1 Grand Challenges of Deep Learning with Big Data ................................. 26 4.1.1 Massive Amounts of Training Sample ............................................ 26 4.1.2 Incremental Streaming Data ............................................................ 26 4.1.3 Learning Speed with Big Data ......................................................... 26 4.1.4 Scalability of Deep Models .............................................................. 27 4.2 Future Work .............................................................................................. 27 References .................................................................................................................... 28 Deep Learning and Its Parallelization: Concepts and Instances Xiaqing Li, Guangyan Zhang, Keqin Li, and Weimin Zheng Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Coded Parallelism for Distributed Deep Learning.

Analyzing the Performance of Graph Neural Networks with Pipe Parallelism

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

PipeMare: Asynchronous Pipeline Parallel DNN Training

Pipeline Parallelism With Elastic Averaging

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

Towards accelerating model parallelism in distributed deep learning systems

2BP: 2-Stage Backpropagation

Deep Learning and Its Parallelization

GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

A Survey on Auto-Parallelism of Large-Scale Deep Learning Training

DISTRIBUTED HIGH-PERFORMANCE COMPUTING METHODS FOR ACCELERATING DEEP LEARNING TRAINING

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis

Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform