DNNSplit: Latency and Cost-Efficient Split Point Identification for Multi-Tier DNN Partitioning

Paridhika Kayal,Alberto Leon-Garcia
DOI: https://doi.org/10.1109/ACCESS.2024.3409057
IF: 3.9
IEEE Access
Abstract:Due to the high computational demands inherent in Deep Neural Network (DNN) executions, multi-tier environments have emerged as preferred platforms for DNN inference tasks. Previous research on partitioning strategies for DNN models typically involved leveraging all layers of the DNN to identify optimal splits aimed at reducing latency or cost. However, due to their computational complexity, these approaches face scalability issues, particularly with models containing hundreds of layers. The novelty of our work lies in uniquely identifying specific split points within various DNN models that consistently lead to efficient latency or cost partitioning. Under the assumption that per unit computing cost decreases in higher tiers and that bandwidth is not free, we show that only these specific split points need to be considered to optimize latency or cost. Importantly, these split points are independent of different infrastructure configurations and bandwidth variations. The key contribution of our work is the significant reduction in the computational complexity of DNN partitioning, making our strategy applicable to models with a large number of layers. Introducing DNNSplit, an adaptive strategy, enables dynamic split decisions in varying conditions with the least complexity. Evaluated across nine DNN models varying in size and architecture, DNNSplit exhibits exceptional effectiveness in optimizing latency and cost. Even for a more substantial model containing 517 layers, it identifies only 5 points as potential split points, thereby reducing the partitioning complexity by more than 100x. This makes DNNSplit especially advantageous for managing larger models. DNNSplit also demonstrates significant improvements for multi-tier deployments compared to single-tier execution, including up to 15x latency speedup, 20x cost reduction, and 5x throughput enhancement.
Computer Science
What problem does this paper attempt to address?