A Map-Reduce Method for Training Autoencoders on Xeon Phi.

Qiongjie Yao,Xiaofei Liao,Hai Jin
DOI: https://doi.org/10.1109/cit/iucc/dasc/picom.2015.197
2015-01-01
Abstract:The stacked autoencoder is a deep learning model that consists of multiple autoencoders. This model has been widely applied in numerous machine learning applications. A significant amount of effort has been made to increase the size of the deep learning model with respect to the size of the training dataset and the parameter of the model to improve performance. However, training a large deep learning model is highly time consuming. Recent studies have applied the CPU cluster with thousands of machines as well as the single GPU or the GPU cluster, to train large scale deep learning models. As a high-performance coprocessor like GPU, the Xeon Phi can be an alternative tool for training large scale deep learning models on a single machine. The Xeon Phi can be recognized as a small cluster which features about 60 cores, and each core supports four hardware threads. Massive parallelism offsets the low computing capacity of every core, but challenges an efficient parallel autoencoders design.In this paper, we analyze the training algorithm of autoencoders based on the matrix operation and point out the thread oversubscription problem, which results in performance degradation. Based on the observation, we propose our map-reduce implementation of autoencoders on the Xeon Phi coprocessor. Our basic idea is to parallelize multiple autoencoder model replicas with bulk synchronous parallel (BSP) communication model where the parameters are updated after the computations of all replicas are completed. Each thread is responsible for one model replica, and all replicas work together on the same mini-batch. This data parallelism method is suitable for training autoencoders on the Xeon Phi, and can extend to asynchronous parallel training method without thread oversubscription. In our experiment the speedup is four times higher than that of sequential implementation. Enlarging the size of the autoencoder model, our method still gets stable speedup.
What problem does this paper attempt to address?