Abstract:Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present <a class="link-external link-http" href="http://tf.data" rel="external noopener nofollow">this http URL</a> service, an open-source disaggregated input data processing service built on top of <a class="link-external link-http" href="http://tf.data" rel="external noopener nofollow">this http URL</a> in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying <a class="link-external link-http" href="http://tf.data" rel="external noopener nofollow">this http URL</a> service in production, including relaxing data visitation guarantees without impacting model accuracy.

nuts-flow/ml: data pre-processing for deep learning

Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines

A Spark ML driven preprocessing approach for deep learning based scholarly data applications

DeepPrep: An accelerated, scalable, and robust pipeline for neuroimaging preprocessing empowered by deep learning

Automated Image Data Preprocessing with Deep Reinforcement Learning

A Comprehensive Evaluation of Metabolomics Data Preprocessing Methods for Deep Learning

Efficient Tabular Data Preprocessing of ML Pipelines

Deep Fast Machine Learning Utils: A Python Library for Streamlined Machine Learning Prototyping

tf.data: A Machine Learning Data Processing Framework

DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline

MLOps: Automatic, Zero-Touch and Reusable Machine Learning Training and Serving Pipelines

The more, the better? Evaluating the role of EEG preprocessing for deep learning applications

A Data-Centric Optimization Framework for Machine Learning

Understanding Unconventional Preprocessors in Deep Convolutional Neural Networks for Face Identification

Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of Machine Learning Models

EdnaML: A Declarative API and Framework for Reproducible Deep Learning

tf.data service: A Case for Disaggregating ML Input Data Processing

Analyzing and Mitigating Data Stalls in DNN Training

Deep Learning Pipeline for Preprocessing and Segmenting Cardiac Magnetic Resonance of Single Ventricle Patients from an Image Registry

STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison