Abstract:The stacked autoencoder is a deep learning model that consists of multiple autoencoders. This model has been widely applied in numerous machine learning applications. A significant amount of effort has been made to increase the size of the deep learning model with respect to the size of the training dataset and the parameter of the model to improve performance. However, training a large deep learning model is highly time consuming. Recent studies have applied the CPU cluster with thousands of machines as well as the single GPU or the GPU cluster, to train large scale deep learning models. As a high-performance coprocessor like GPU, the Xeon Phi can be an alternative tool for training large scale deep learning models on a single machine. The Xeon Phi can be recognized as a small cluster which features about 60 cores, and each core supports four hardware threads. Massive parallelism offsets the low computing capacity of every core, but challenges an efficient parallel autoencoders design.In this paper, we analyze the training algorithm of autoencoders based on the matrix operation and point out the thread oversubscription problem, which results in performance degradation. Based on the observation, we propose our map-reduce implementation of autoencoders on the Xeon Phi coprocessor. Our basic idea is to parallelize multiple autoencoder model replicas with bulk synchronous parallel (BSP) communication model where the parameters are updated after the computations of all replicas are completed. Each thread is responsible for one model replica, and all replicas work together on the same mini-batch. This data parallelism method is suitable for training autoencoders on the Xeon Phi, and can extend to asynchronous parallel training method without thread oversubscription. In our experiment the speedup is four times higher than that of sequential implementation. Enlarging the size of the autoencoder model, our method still gets stable speedup.

A Map-Reduce Method for Training Autoencoders on Xeon Phi.

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi

Parallelizing Convolutional Neural Networks On Intel (R) Many Integrated Core Architecture

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Accelerated Synchronous Model Parallelism Using Cooperative Process for Training Compute-Intensive Models

An Efficient 2D Method for Training Super-Large Deep Learning Models

Efficient Scheduling in Training Deep Convolutional Networks at Large Scale

Faster learning of deep stacked autoencoders on multi-core systems using synchronized layer-wise pre-training

Towards Modeling Energy Consumption of Xeon Phi

On the use of Deep Autoencoders for Efficient Embedded Reinforcement Learning

Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks.

A parallel computing platform for training large scale neural networks

Exponential Moving Average Model in Parallel Speech Recognition Training

NUMERICAL SIMULATION OF PLANETARY FLUID DYNAMICS ON CPU-MIC HETEROGENEOUS MANY-CORE SYSTEMS

Orthogonal Nonnegative Matrix Factorization using a novel deep Autoencoder Network

Cache Friendly Parallelization of Neural Encoder-Decoder Models Without Padding on Multi-core Architecture.

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs