Abstract:Cascade systems, consisting of a lightweight model processing all samples and a heavier, high-accuracy model refining challenging samples, have become a widely-adopted distributed inference approach to achieving high accuracy and maintaining a low computational burden for mobile and IoT devices. As intelligent indoor environments, like smart homes, continue to expand, a new scenario emerges, the multi-device cascade. In this setting, multiple diverse devices simultaneously utilize a shared heavy model hosted on a server, often situated within or close to the consumer environment. This work introduces MultiTASC++, a continuously adaptive multi-tenancy-aware scheduler that dynamically controls the forwarding decision functions of devices to optimize system throughput while maintaining high accuracy and low latency. Through extensive experimentation in diverse device environments and with varying server-side models, we demonstrate the scheduler's efficacy in consistently maintaining a targeted satisfaction rate while providing the highest available accuracy across different device tiers and workloads of up to 100 devices. This demonstrates its scalability and efficiency in addressing the unique challenges of collaborative DNN inference in dynamic and diverse IoT environments.

What problem does this paper attempt to address?

This paper attempts to address the challenges faced when multiple intelligent devices share an edge server for deep learning (DL) inference tasks in a multi - device cascading architecture. Specifically, with the expansion of intelligent indoor environments (such as smart homes), new scenarios have emerged where multiple devices use shared heavy models simultaneously. In this case, the system needs to be scalable to balance fast response times and high accuracy, and avoid problems such as system overload caused by traditional methods or loss of accuracy advantages due to complete reliance on local execution. ### Core Problems of the Paper 1. **Concurrent Access by Multiple Devices**: When multiple devices simultaneously request the edge server for complex - model inference, how to ensure the efficiency and response speed of the system. 2. **Resource Allocation and Model Selection**: In a multi - device environment, how to dynamically adjust the forwarding decision function of each device to optimize system throughput, maintain high accuracy, and low latency. 3. **Dynamic Adaptability**: Facing constantly changing workloads and device requirements, how to achieve continuous adaptive scheduling to deal with the challenges under different device levels and workload conditions. ### Solutions To solve these problems, the paper proposes **MultiTASC++**, a continuously adaptive multi - tenant - aware scheduler aimed at optimizing the inference request arrival rate in a multi - device cascading architecture. Its main contributions include: - **System Model**: Expands the cascading architecture to adapt to the multi - device environment, reveals adjustable parameters, enabling system designers to systematically study its trade - offs. - **New Scheduler**: Introduces a new multi - tenant - aware scheduler. Through more refined reconfiguration of the forwarding decision function, taking into account the latency requirements of each device, more effective device - customized adaptation is achieved. - **Continuous Adjustment**: Enables continuous adjustment of the forwarding decision function instead of discrete steps, thereby improving adaptability. - **Server Model Switching**: Allows server - side models to be dynamically switched according to different latency - accuracy trade - offs, increasing the flexibility of the architecture. ### Key Formulas - **BvSB Metric**: \[ \text{BvSB}(f(x))=P_1 - P_2 \] where \(P_1\) and \(P_2\) are the highest and second - highest values in the softmax output of the model respectively. - **Threshold Update Rule**: \[ \Delta \text{thresh}=-a\cdot(SR_{\text{target}} - SR_{\text{update}}) \] where \(\Delta \text{thresh}\) is the threshold adjustment amount, \(SR_{\text{target}}\) is the target SLO satisfaction rate, \(SR_{\text{update}}\) is the SLO satisfaction rate sent by the device, and \(a\) is a scaling factor. Through these improvements, MultiTASC++ can better cope with the unique challenges of dynamic and diverse Internet of Things environments in multi - device cascading architectures, providing higher scalability and efficiency.

MultiTASC++: A Continuously Adaptive Scheduler for Edge-Based Multi-Device Cascade Inference

MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Joint Device Scheduling and Resource Allocation for ISCC-Based Multi-View-Multi-Task Inference

Octopus: SLO-Aware Progressive Inference Serving via Deep Reinforcement Learning in Multi-tenant Edge Cluster

Task-Oriented Sensing, Computation, and Communication Integration for Multi-Device Edge AI

T2C: A Multi-User System for Deploying DNNs in a Thing-to-Cloud Continuum

Edge-device Collaborative Computing for Multi-view Classification

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

CascadeServe: Unlocking Model Cascades for Inference Serving

DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Collaborative Inference for Deep Neural Networks in Edge Environments

Collaborative Inference Acceleration Integrating DNN Partitioning and Task Offloading in Mobile Edge Computing

Automated Exploration and Implementation of Distributed CNN Inference at the Edge

BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge Devices

Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy

Accelerate Intermittent Deep Inference

Cascade: A Platform for Delay-Sensitive Edge Intelligence

COS: Cross-Processor Operator Scheduling for Multi-Tenant Deep Learning Inference

Multi-Model Running Latency Optimization in an Edge Computing Paradigm