Abstract:In recent years, machine learning and deep learning techniques such as deep neural networks and recurrent neural networks have found uses in diverse fields including computer vision, speech recognition, natural language processing, social network analysis, bioinformatics and medicine, where they have produced results comparable to and in some cases surpassing human experts. Machine learning requires large amount of data for training its models with much of this data residing in object storage, an inexpensive and scalable data store. Also, deep learning make use of state of the art processing capabilities from high-end GPUs and accelerators, such as Google Tensor Processing Units (TPUs), which enable parallel and efficient execution. The throughput that such GPUs can support is very high. This however constitutes an impedance mismatch as the object storage is not designed for high performance data transfers and standard practices for feeding deep learning models from the object storage can result in poor training performance. Furthermore, the typical deep learning framework uses a file access interface, and object storage support a REST based interface with different APIs and semantics than a file system [2]. To fully take advantage of these GPUs and operate at full utilization, frameworks, such as TensorFlow, Cafe, and Torch, needs to deliver data as fast as possible to keep the GPUs busy. This becomes a significant challenge when the training data does not reside in the same machine as the GPUs, as is the case when using object storage, resulting in a utilization challenge for the expensive processing units. To solve the impedance mismatch and keep the processing units fully utilized, we have added a FUSE based file system, S3fs [1], to our deep learning stack. S3fs translates POSIX file API requests into REST API against the object storage. It is an open source project which, as part of this work, we optimized so that read requests are performed using new innovative logic that translates the requests into multiple concurrent range reads requests against the object storage. This enables us to obtain higher throughput from the object storage than is possible using the naive approach. Reads are cached in memory and are served back to the deep learning framework asynchronously. Since deep learning frameworks often run their training in multiple epochs the in memory cache speed is highly beneficial. Our FUSE based architecture has been implemented in the Deep Learning as a Service offering on the IBM Cloud, and our S3fs enhancements have been contributed to the S3fs project repository. Using our architecture we are able to speed up deep learning performance many folds and keep expensive GPUs fully utilized.

FanStore: Enabling Efficient and Scalable I/O for Distributed Deep Learning

High Performance I/O For Large Scale Deep Learning

A Novel Scalable Architecture of Cloud Storage System for Small Files Based on P2P

Keeping deep learning GPUs well fed using object storage

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

Shastor: A Scalable Hdfs-Based Storage Framework For Small-Write Efficiency In Pervasive Computing

The Quest For Scalable Support Of Data-Intensive Workloads In Distributed Systems

-IO: A Unified IO Stack for Computational Storage

High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms

SepStore: Data Storage Accelerator for Distributed File Systems by Separating Small Files from Large Files.

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training.

LightFS: A Lightweight Host-CSD Coordinated File System Optimizing for Heavy Small File Accesses

STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computational Storage

Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs

Cognitive Ssd: A Deep Learning Engine For In-Storage Data Retrieval

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Automating distributed tiered storage management in cluster computing

DFS-Perf : A Scalable and Unified Benchmarking Framework for Distributed File Systems

Conflux: Exploiting Persistent Memory and RDMA Bandwidth Via Adaptive I/O Mode Selection.

Collage: Seamless Integration of Deep Learning Backends with Automatic Placement

OCStore: Accelerating Distributed Object Storage with Open-Channel SSDs