Abstract:In recent years, machine learning and deep learning techniques such as deep neural networks and recurrent neural networks have found uses in diverse fields including computer vision, speech recognition, natural language processing, social network analysis, bioinformatics and medicine, where they have produced results comparable to and in some cases surpassing human experts. Machine learning requires large amount of data for training its models with much of this data residing in object storage, an inexpensive and scalable data store. Also, deep learning make use of state of the art processing capabilities from high-end GPUs and accelerators, such as Google Tensor Processing Units (TPUs), which enable parallel and efficient execution. The throughput that such GPUs can support is very high. This however constitutes an impedance mismatch as the object storage is not designed for high performance data transfers and standard practices for feeding deep learning models from the object storage can result in poor training performance. Furthermore, the typical deep learning framework uses a file access interface, and object storage support a REST based interface with different APIs and semantics than a file system [2]. To fully take advantage of these GPUs and operate at full utilization, frameworks, such as TensorFlow, Cafe, and Torch, needs to deliver data as fast as possible to keep the GPUs busy. This becomes a significant challenge when the training data does not reside in the same machine as the GPUs, as is the case when using object storage, resulting in a utilization challenge for the expensive processing units. To solve the impedance mismatch and keep the processing units fully utilized, we have added a FUSE based file system, S3fs [1], to our deep learning stack. S3fs translates POSIX file API requests into REST API against the object storage. It is an open source project which, as part of this work, we optimized so that read requests are performed using new innovative logic that translates the requests into multiple concurrent range reads requests against the object storage. This enables us to obtain higher throughput from the object storage than is possible using the naive approach. Reads are cached in memory and are served back to the deep learning framework asynchronously. Since deep learning frameworks often run their training in multiple epochs the in memory cache speed is highly beneficial. Our FUSE based architecture has been implemented in the Deep Learning as a Service offering on the IBM Cloud, and our S3fs enhancements have been contributed to the S3fs project repository. Using our architecture we are able to speed up deep learning performance many folds and keep expensive GPUs fully utilized.

FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

Accelerating End-to-End Deep Learning Workflow With Codesign of Data Preprocessing and Scheduling.

TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading

Accelerating Deep Learning Inference via Model Parallelism and Partial Computation Offloading

ElasticFlow: an Elastic Serverless Training Platform for Distributed Deep Learning.

Energy-Efficient GPU Clusters Scheduling for Deep Learning

NanoFlow: Towards Optimal Large Language Model Serving Throughput

SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge Servers

ServeFlow: A Fast-Slow Model Architecture for Network Traffic Analysis

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation

Keeping deep learning GPUs well fed using object storage

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

Pipeline-based Optimization Method for Large-Scale End-to-End Inference.

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Work-in-Progress: Furion: Alleviating Overheads for Deep Learning Framework on Single Machine