Abstract:Deploying Convolutional Neural Networks (CNNs) on resource-constrained devices necessitates efficient management of computational resources, often via distributed systems susceptible to latency from straggler nodes. This paper introduces the Flexible Coded Distributed Convolution Computing (FCDCC) framework to enhance fault tolerance and numerical stability in distributed CNNs. We extend Coded Distributed Computing (CDC) with Circulant and Rotation Matrix Embedding (CRME) which was originally proposed for matrix multiplication to high-dimensional tensor convolution. For the proposed scheme, referred to as Numerically Stable Coded Tensor Convolution (NSCTC) scheme, we also propose two new coded partitioning schemes: Adaptive-Padding Coded Partitioning (APCP) for input tensor and Kernel-Channel Coded Partitioning (KCCP) for filter tensor. These strategies enable linear decomposition of tensor convolutions and encoding them into CDC sub-tasks, combining model parallelism with coded redundancy for robust and efficient execution. Theoretical analysis identifies an optimal trade-off between communication and storage costs. Empirical results validate the framework's effectiveness in computational efficiency, fault tolerance, and scalability across various CNN architectures.

What problem does this paper attempt to address?

This paper attempts to address the main challenges encountered during the deployment in distributed convolutional neural networks (CNNs), especially the problem of computational resource management on resource - constrained devices. Specifically, the paper focuses on the following aspects: 1. **Computational Efficiency and Fault Tolerance**: In a distributed system, due to hardware heterogeneity and unstable network conditions in computing nodes (especially edge devices), some nodes (referred to as "stragglers") may run slowly, which will affect the performance and reliability of the entire system. The paper proposes a new framework - Flexible Coding for Distributed Convolutional Computation (FCDCC), aiming to improve computational efficiency and fault - tolerance ability and reduce the latency caused by stragglers. 2. **Numerical Stability**: Traditional Coded Distributed Computation (CDC) methods have the problem of numerical instability when dealing with high - dimensional tensor convolutions. Especially in deep - learning models, the accumulated error will increase significantly as the network depth increases. By introducing the Cyclic Matrix and Rotation Matrix Embedding (CRME) technique, the paper proposes a Numerically Stable Coded Tensor Convolution (NSCTC) scheme to solve this problem. 3. **Efficient Tensor Partitioning Strategies**: In order to effectively manage high - dimensional tensor structures and ensure the trade - off between numerical stability and optimizing communication and storage costs, the paper proposes two new tensor partitioning strategies: - **Adaptive Padding Coding Partitioning (APCP)**: Used for the partitioning of input tensors, it reduces the communication cost and workload of each node through adaptive padding. - **Kernel - Channel Coding Partitioning (KCCP)**: Used for the partitioning of filter tensors, it generates coded partitions by non - overlapping partitioning along the output - channel dimension to enhance robustness and reduce storage costs and the workload of each node. 4. **Framework Optimization**: The paper analyzes the FCDCC framework and determines the optimal partitioning parameters \(k_A\) and \(k_B\) to balance communication and storage costs while maintaining a fixed number of subtasks. 5. **Generality**: The proposed framework is applicable to multiple CNN libraries (such as PyTorch) and different CNN models (such as LeNet, AlexNet, and VGGNet), and has wide applicability. In summary, this paper mainly solves the key problems faced by the efficient and stable deployment of CNNs in a distributed environment. By introducing new coding techniques and optimization strategies, it improves the computational efficiency, fault - tolerance ability, and numerical stability of the system.

Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Non-Linear Coded Computation for Distributed CNN Inference: A Learning-based Approach

Compressed Coded Distributed Computing

Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

Enhancing Distributed In-Situ CNN Inference in the Internet of Things

Lagrange Coded Computing: Optimal Design For Resiliency, Security, And Privacy

Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy

FPGA Implementation of a Fault-Tolerant Fused and Branched CNN Accelerator With Reconfigurable Capabilities

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

FedDCT: Federated Learning of Large Convolutional Neural Networks on Resource Constrained Devices using Divide and Collaborative Training

FLCD: A Flexible Low Complexity Design of Coded Distributed Computing

CFCNN: A novel convolutional fusion framework for collaborative fault identification of rotating machinery

Hierarchical Coded Matrix Multiplication in Heterogeneous Multihop Networks

Efficient Scheduling of Irregular Network Structures on CNN Accelerators

Researching the CNN Collaborative Inference Mechanism for Heterogeneous Edge Devices

Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier Transforms

A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation

Adaptive Privacy-Preserving Coded Computing With Hierarchical Task Partitioning

Adaptive Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning