Abstract:Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present <a class="link-external link-http" href="http://tf.data" rel="external noopener nofollow">this http URL</a> service, an open-source disaggregated input data processing service built on top of <a class="link-external link-http" href="http://tf.data" rel="external noopener nofollow">this http URL</a> in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying <a class="link-external link-http" href="http://tf.data" rel="external noopener nofollow">this http URL</a> service in production, including relaxing data visitation guarantees without impacting model accuracy.

TensorFlow: A system for large-scale machine learning

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Survey: Tensorflow in Machine Learning

A Tour of TensorFlow

tf.data: A Machine Learning Data Processing Framework

TensorFlow Doing HPC

Efficient Distributed Image Recognition Algorithm of Deep Learning Framework TensorFlow

Optimal distributed parallel algorithms for deep learning framework Tensorflow

DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications

The Tensor Data Platform: Towards an AI-centric Database System

Swift for TensorFlow: A portable, flexible platform for deep learning

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

PowerAI DDL

TF-GNN: Graph Neural Networks in TensorFlow

A TensorFlow-based New High-Performance Computational Framework for CFD

FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline

tf.data service: A Case for Disaggregating ML Input Data Processing

swFLOW: A large-scale distributed framework for deep learning on Sunway TaihuLight supercomputer

TFLMS: Large Model Support in TensorFlow by Graph Rewriting

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Deep Learning With TensorFlow: A Review