Abstract:......................................................................................................................... 3 1 Introduction ............................................................................................................. 3 1.1 Application Background ............................................................................... 4 1.2 Performance Demands for Deep Learning ................................................... 4 1.3 Existing Parallel Frameworks of Deep Learning ......................................... 4 1.4 Chapter Organization ................................................................................... 5 2 Concepts and Categories of Deep Learning ............................................................ 5 2.1 Deep Learning ............................................................................................. 5 2.1.1 Artificial Neural Networks ................................................................ 5 2.1.2 Concept of Deep Learning ................................................................. 7 2.2 Mainstream Deep Learning Models ............................................................. 8 2.2.1 Autoencoders ..................................................................................... 8 2.2.2 Back Propagation ............................................................................... 9 2.2.3 Convolutional Neural Network ........................................................ 11 3 Parallel Optimization for Deep Learning .............................................................. 13 3.1 Convolutional Architecture for Fast Feature Embedding ......................... 13 3.1.1 Introduction ...................................................................................... 13 3.1.2 CUDA Programming ....................................................................... 14 3.1.3 Architecture of Caffe ....................................................................... 17 3.1.4 Parallel Implementation of Convolution in Caffe ............................ 18 3.2 DistBelief .................................................................................................. 20 3.2.1 Introduction of DistBelief ................................................................ 20 3.2.2 Downpour SGD ............................................................................... 20 3.2.4 Sandblaster L-BFGS ........................................................................ 21 3.3 Deep Learning Based-on Multi-GPUs ...................................................... 22 3.3.1 Data Parallelism ............................................................................... 22 3.3.2 Model Parallelism ............................................................................ 23 3.3.3 Data-Model Parallelism ................................................................... 24 3.3.4 Example System of Multi-GPUs ..................................................... 25 4 Discussions ........................................................................................................... 26 4.1 Grand Challenges of Deep Learning with Big Data ................................. 26 4.1.1 Massive Amounts of Training Sample ............................................ 26 4.1.2 Incremental Streaming Data ............................................................ 26 4.1.3 Learning Speed with Big Data ......................................................... 26 4.1.4 Scalability of Deep Models .............................................................. 27 4.2 Future Work .............................................................................................. 27 References .................................................................................................................... 28 Deep Learning and Its Parallelization: Concepts and Instances Xiaqing Li, Guangyan Zhang, Keqin Li, and Weimin Zheng Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Coded Parallelism for Distributed Deep Learning.

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

A Linear Algebraic Approach to Model Parallelism in Deep Learning

Non-Linear Coded Computation for Distributed CNN Inference: A Learning-based Approach

Joint Coding and Scheduling Optimization for Distributed Learning Over Wireless Edge Networks

Deep Learning and Its Parallelization

Integrated Model, Batch and Domain Parallelism in Training Neural Networks

DISTRIBUTED HIGH-PERFORMANCE COMPUTING METHODS FOR ACCELERATING DEEP LEARNING TRAINING

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

Distributed Newton Methods for Deep Neural Networks

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Optimal distributed parallel algorithms for deep learning framework Tensorflow

Slim-DP: A Multi-Agent System for Communication-Efficient Distributed Deep Learning

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis