Abstract:Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92% while only incurring less than 10\% latency overhead for critical tasks, compared to state of art baselines.

What problem does this paper attempt to address?

The paper attempts to address the issue of effectively managing resource contention when running multiple deep neural network (DNN) tasks simultaneously on edge GPUs to meet the real-time performance requirements of different tasks. Specifically, the paper focuses on how to coordinate multiple DNN tasks with different criticality levels on resource-constrained edge GPUs, ensuring that critical tasks are prioritized while maximizing the overall system throughput. Traditional solutions either lead to increased latency for critical tasks or sacrifice the throughput of non-critical tasks. Therefore, the paper proposes a new system called Miriam, which aims to optimize resource management and scheduling of multiple DNN inference tasks by introducing Elastic Kernels, thereby improving overall system performance while ensuring the real-time performance of critical tasks. The main contributions of Miriam include: 1. **Elastic Kernel Generation**: Transforming traditional kernels into elastic kernels that can dynamically adjust resource usage patterns to meet different task requirements. 2. **Runtime Dynamic Kernel Coordination**: Dynamically selecting the optimal elastic kernel configuration at runtime based on current resource usage, ensuring that critical tasks are not disrupted while maximizing resource utilization for non-critical tasks. 3. **Performance Evaluation**: Validating the effectiveness of Miriam by constructing new DNN inference benchmarks. Experimental results show that compared to existing methods, Miriam can significantly improve system throughput while maintaining low latency for critical tasks. Through these techniques, Miriam effectively addresses the issue of resource contention when concurrently executing multiple DNN tasks on edge GPUs, enhancing overall system performance and the real-time responsiveness of critical tasks.

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

CoEdge: Cooperative DNN Inference With Adaptive Workload Partitioning Over Heterogeneous Edge Devices

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

G-NMP: Accelerating Graph Neural Networks with DIMM-based Near-Memory Processing

RT-mDL: Supporting Real-Time Mixed Deep Learning Tasks on Edge Platforms

GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

MoEI: Mobility-Aware Edge Inference Based on Model Partition and Service Migration

Joint DNN partitioning and task offloading in mobile edge computing via deep reinforcement learning

Multi-user Co-inference with Batch Processing Capable Edge Server

An Online Approach for DNN Model Caching and Processor Allocation in Edge Computing

Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices

RT-mDL

Task Partitioning and Offloading in DNN-Task Enabled Mobile Edge Computing Networks