Abstract:As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find high-dimensional representations or abstract features which work much better than the principal component analysis (PCA) method. However, it will face problems when being applied to deal with large scale data due to its intensive computation from many levels of training process against large scale data. The sequential deep learning algorithms usually can not finish the computation in an acceptable time. In this paper, we propose a many-core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine (RBM). Using the sequential training algorithm as a baseline to compare, we adopted several optimization methods to parallelize the algorithm. The experimental results show that our fully-optimized algorithm gains more than 300-fold speedup on parallelized Sparse Autoencoder compared with the original sequential algorithm on the Intel Xeon Phi coprocessor. Also, we ran the fully-optimized code on both the Intel Xeon Phi coprocessor and an expensive Intel Xeon CPU. Our method on the Intel Xeon Phi coprocessor is 7 to 10 times faster than the Intel Xeon CPU for this application. In addition to this, we compared our fully-optimized code on the Intel Xeon Phi with a Matlab code running on single Intel Xeon CPU. Our method on the Intel Xeon Phi runs 16 times faster than the Matlab implementation. The result also suggests that the Intel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. It also achieves faster speed with better parallelism than the Intel Xeon CPU.

Parallel Computing In Dnns Using Cpu And Mic

DaDianNao: A Machine-Learning Supercomputer

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

Parallelizing Convolutional Neural Networks On Intel (R) Many Integrated Core Architecture

Asynchronous Parallel Dijkstra's Algorithm on Intel Xeon Phi Processor - How to Accelerate Irregular Memory Access Algorithm.

Accelerating FDTD Simulation of Microwave Pulse Coupling into Narrow Slots on the Intel MIC Architecture

Accelerating Embarrassingly Parallel Algorithm on Intel Mic

MIC acceleration of short-range molecular dynamics simulations

CAP: Communication-aware Automated Parallelization for Deep Learning Inference on CMP Architectures

Parallelization and Optimization of Molecular Dynamics Simulation on Many Integrated Core

Trends of Intel MIC Application in Bioinformatics

A Parallel Non-Local Means Denoising Algorithm Implementation with OpenMP and OpenCL on Intel Xeon Phi Coprocessor

Coded Parallelism for Distributed Deep Learning.

ParaX: boosting deep learning for big data analytics on many-core CPUs

Parallelization And Performance Optimization Of Calculation In Three-Dimensional Underwater Acoustic Propagation On Modern Many-Core Processor

Solving the Cardiac Model Using Multi-core CPU and Many Integrated Cores (MIC)

Accelerated 3 D Full Band Self-consistent Ensemble Monte Carlo Device Simulation Utilizing Intel MIC Coprocessors on TianHe II

MIC-THPCM: MIC-Based Heterogeneous Parallel Optimization for Axial Compressor Rotor

Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures

A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer

An architecture-level analysis on deep learning models for low-impact computations