Abstract:In recent years, machine learning and deep learning techniques such as deep neural networks and recurrent neural networks have found uses in diverse fields including computer vision, speech recognition, natural language processing, social network analysis, bioinformatics and medicine, where they have produced results comparable to and in some cases surpassing human experts. Machine learning requires large amount of data for training its models with much of this data residing in object storage, an inexpensive and scalable data store. Also, deep learning make use of state of the art processing capabilities from high-end GPUs and accelerators, such as Google Tensor Processing Units (TPUs), which enable parallel and efficient execution. The throughput that such GPUs can support is very high. This however constitutes an impedance mismatch as the object storage is not designed for high performance data transfers and standard practices for feeding deep learning models from the object storage can result in poor training performance. Furthermore, the typical deep learning framework uses a file access interface, and object storage support a REST based interface with different APIs and semantics than a file system [2]. To fully take advantage of these GPUs and operate at full utilization, frameworks, such as TensorFlow, Cafe, and Torch, needs to deliver data as fast as possible to keep the GPUs busy. This becomes a significant challenge when the training data does not reside in the same machine as the GPUs, as is the case when using object storage, resulting in a utilization challenge for the expensive processing units. To solve the impedance mismatch and keep the processing units fully utilized, we have added a FUSE based file system, S3fs [1], to our deep learning stack. S3fs translates POSIX file API requests into REST API against the object storage. It is an open source project which, as part of this work, we optimized so that read requests are performed using new innovative logic that translates the requests into multiple concurrent range reads requests against the object storage. This enables us to obtain higher throughput from the object storage than is possible using the naive approach. Reads are cached in memory and are served back to the deep learning framework asynchronously. Since deep learning frameworks often run their training in multiple epochs the in memory cache speed is highly beneficial. Our FUSE based architecture has been implemented in the Deep Learning as a Service offering on the IBM Cloud, and our S3fs enhancements have been contributed to the S3fs project repository. Using our architecture we are able to speed up deep learning performance many folds and keep expensive GPUs fully utilized.

tf.data service: A Case for Disaggregating ML Input Data Processing

tf.data: A Machine Learning Data Processing Framework

Efficient Tabular Data Preprocessing of ML Pipelines

GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning.

FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline

DisaggRec: Architecting Disaggregated Systems for Large-Scale Personalized Recommendation

Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

Dual-pronged deep learning preprocessing on heterogeneous platforms with CPU, GPU and CSD

Keeping deep learning GPUs well fed using object storage

Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Overlapped Data Processing Scheme for Accelerating Training and Validation in Machine Learning

TensorFlow: A system for large-scale machine learning

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

TACC: A Full-stack Cloud Computing Infrastructure for Machine Learning Tasks

MLaaS4HEP: Machine Learning as a Service for HEP

Accelerated Cloud for Artificial Intelligence (ACAI)

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

DistMind: Efficient Resource Disaggregation for Deep Learning Workloads

Near-Data Processing for Differentiable Machine Learning Models