Abstract:Machine learning algorithms are typically run on large scale, distributed compute infrastructure that routinely face a number of unavailabilities such as failures and temporary slowdowns. Adding redundant computations using coding-theoretic tools called "codes" is an emerging technique to alleviate the adverse effects of such unavailabilities. A code consists of an encoding function that proactively introduces redundant computation and a decoding function that reconstructs unavailable outputs using the available ones. Past work focuses on using codes to provide resilience for linear computations and specific iterative optimization algorithms. However, computations performed for a variety of applications including inference on state-of-the-art machine learning algorithms, such as neural networks, typically fall outside this realm. In this paper, we propose taking a learning-based approach to designing codes that can handle non-linear computations. We present carefully designed neural network architectures and a training methodology for learning encoding and decoding functions that produce approximate reconstructions of unavailable computation results. We present extensive experimental results demonstrating the effectiveness of the proposed approach: we show that the our learned codes can accurately reconstruct $64 - 98\%$ of the unavailable predictions from neural-network based image classifiers on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. To the best of our knowledge, this work proposes the first learning-based approach for designing codes, and also presents the first coding-theoretic solution that can provide resilience for any non-linear (differentiable) computation. Our results show that learning can be an effective technique for designing codes, and that learned codes are a highly promising approach for bringing the benefits of coding to non-linear computations.

Non-Linear Coded Computation for Distributed CNN Inference: A Learning-based Approach

Coded Parallelism for Distributed Deep Learning.

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Coded Distributed Image Classification

Enhancing Distributed In-Situ CNN Inference in the Internet of Things

Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation

Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices

Neural Network Coding of Difference Updates for Efficient Distributed Learning Communication

Researching the CNN Collaborative Inference Mechanism for Heterogeneous Edge Devices

Collaborative edge computing for distributed CNN inference acceleration using receptive field-based segmentation

Compressed Coded Distributed Computing

Automated Exploration and Implementation of Distributed CNN Inference at the Edge

Network Coding Approaches for Distributed Computation over Lossy Wireless Networks.

Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing

Attention-based Feature Compression for CNN Inference Offloading in Edge Computing

Adaptive Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning

DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices

Cooperative Inference with Interleaved Operator Partitioning for CNNs