Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs

Shuo Tan,Rui Liu,XianLei Long,Kai Wan,Linqi Song,Yong Li
2024-11-03
Abstract:Deploying Convolutional Neural Networks (CNNs) on resource-constrained devices necessitates efficient management of computational resources, often via distributed systems susceptible to latency from straggler nodes. This paper introduces the Flexible Coded Distributed Convolution Computing (FCDCC) framework to enhance fault tolerance and numerical stability in distributed CNNs. We extend Coded Distributed Computing (CDC) with Circulant and Rotation Matrix Embedding (CRME) which was originally proposed for matrix multiplication to high-dimensional tensor convolution. For the proposed scheme, referred to as Numerically Stable Coded Tensor Convolution (NSCTC) scheme, we also propose two new coded partitioning schemes: Adaptive-Padding Coded Partitioning (APCP) for input tensor and Kernel-Channel Coded Partitioning (KCCP) for filter tensor. These strategies enable linear decomposition of tensor convolutions and encoding them into CDC sub-tasks, combining model parallelism with coded redundancy for robust and efficient execution. Theoretical analysis identifies an optimal trade-off between communication and storage costs. Empirical results validate the framework's effectiveness in computational efficiency, fault tolerance, and scalability across various CNN architectures.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence,Computer Vision and Pattern Recognition,Information Theory,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the main challenges encountered during the deployment in distributed convolutional neural networks (CNNs), especially the problem of computational resource management on resource - constrained devices. Specifically, the paper focuses on the following aspects: 1. **Computational Efficiency and Fault Tolerance**: In a distributed system, due to hardware heterogeneity and unstable network conditions in computing nodes (especially edge devices), some nodes (referred to as "stragglers") may run slowly, which will affect the performance and reliability of the entire system. The paper proposes a new framework - Flexible Coding for Distributed Convolutional Computation (FCDCC), aiming to improve computational efficiency and fault - tolerance ability and reduce the latency caused by stragglers. 2. **Numerical Stability**: Traditional Coded Distributed Computation (CDC) methods have the problem of numerical instability when dealing with high - dimensional tensor convolutions. Especially in deep - learning models, the accumulated error will increase significantly as the network depth increases. By introducing the Cyclic Matrix and Rotation Matrix Embedding (CRME) technique, the paper proposes a Numerically Stable Coded Tensor Convolution (NSCTC) scheme to solve this problem. 3. **Efficient Tensor Partitioning Strategies**: In order to effectively manage high - dimensional tensor structures and ensure the trade - off between numerical stability and optimizing communication and storage costs, the paper proposes two new tensor partitioning strategies: - **Adaptive Padding Coding Partitioning (APCP)**: Used for the partitioning of input tensors, it reduces the communication cost and workload of each node through adaptive padding. - **Kernel - Channel Coding Partitioning (KCCP)**: Used for the partitioning of filter tensors, it generates coded partitions by non - overlapping partitioning along the output - channel dimension to enhance robustness and reduce storage costs and the workload of each node. 4. **Framework Optimization**: The paper analyzes the FCDCC framework and determines the optimal partitioning parameters \(k_A\) and \(k_B\) to balance communication and storage costs while maintaining a fixed number of subtasks. 5. **Generality**: The proposed framework is applicable to multiple CNN libraries (such as PyTorch) and different CNN models (such as LeNet, AlexNet, and VGGNet), and has wide applicability. In summary, this paper mainly solves the key problems faced by the efficient and stable deployment of CNNs in a distributed environment. By introducing new coding techniques and optimization strategies, it improves the computational efficiency, fault - tolerance ability, and numerical stability of the system.