Abstract:As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find high-dimensional representations or abstract features which work much better than the principal component analysis (PCA) method. However, it will face problems when being applied to deal with large scale data due to its intensive computation from many levels of training process against large scale data. The sequential deep learning algorithms usually can not finish the computation in an acceptable time. In this paper, we propose a many-core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine (RBM). Using the sequential training algorithm as a baseline to compare, we adopted several optimization methods to parallelize the algorithm. The experimental results show that our fully-optimized algorithm gains more than 300-fold speedup on parallelized Sparse Autoencoder compared with the original sequential algorithm on the Intel Xeon Phi coprocessor. Also, we ran the fully-optimized code on both the Intel Xeon Phi coprocessor and an expensive Intel Xeon CPU. Our method on the Intel Xeon Phi coprocessor is 7 to 10 times faster than the Intel Xeon CPU for this application. In addition to this, we compared our fully-optimized code on the Intel Xeon Phi with a Matlab code running on single Intel Xeon CPU. Our method on the Intel Xeon Phi runs 16 times faster than the Matlab implementation. The result also suggests that the Intel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. It also achieves faster speed with better parallelism than the Intel Xeon CPU.

Utilizing Multiple Xeon Phi Coprocessors on One Compute Node.

Communication‐hiding Programming for Clusters with Multi‐coprocessor Nodes

Test-driving Intel Xeon Phi

Characterizing and Optimizing Java-based HPC Applications on Intel Many-Core Architecture.

An Empirical Study of Intel Xeon Phi.

Accelerating Multiple Replica Molecular Dynamics Simulations Using the Intel® Xeon Phi™ Coprocessor

Experimentation Procedure for Offloaded Mini-Apps Executed on Cluster Architectures with Xeon Phi Accelerators

An Early Performance Evaluation Of Opencl On Intel Xeon Phi

Towards Modeling Energy Consumption of Xeon Phi

NUMERICAL SIMULATION OF PLANETARY FLUID DYNAMICS ON CPU-MIC HETEROGENEOUS MANY-CORE SYSTEMS

Open JDK Meets Xeon Phi: A Comprehensive Study of Java HPC on Intel Many-Core Architecture.

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

Exploring Synchronization in Cache Coherent Manycore Systems: A Case Study with Xeon Phi

Performance Study of Monte Carlo Codes on Xeon Phi Coprocessors — Testing MCNP 6.1 and Profiling ARCHER Geometry Module on the FS7ONNi Problem

The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications

Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems

Cluster-level tuning of a shallow water equation solver on the Intel MIC architecture

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

Performance Modeling and Optimization of Parallel LU-SGS on Many-Core Processors for 3D High-Order CFD Simulations

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems