Abstract:Recent years have witnessed significant achievements in deep learning (DL) technologies. In the meantime, an increasing number of online service operators take advantage of deep learning to provide intelligent and personalized services. Although significant efforts have been put into optimizing the inference efficiency, our investigation shows that for many DL models that process data-intensive requests, the network I/O subsystem also plays an essential role in determining responsiveness. Furthermore, under the latency constraint, uncontrolled network flow processing will impact request batching. Based on the above observation, this paper proposes CoFB, an inference service system that optimizes performance in a holistic way. CoFB improves the load imbalance in the network I/O subsystem with a lightweight flow scheduling scheme that collaborates the network interface card with a dispatcher thread. In addition, CoFB introduces a request reordering and batching policy and an interference-aware concurrent batch throttling strategy for enforcing inference concerning the deadline. We evaluate CoFB on four DL inference services and compare it to two state-of-the-art inference systems: NVIDIA Triton and DVABatch. Experimental results show that CoFB outperforms these two baselines by serving up to 2.69×documentclass[12pt]{minimal}usepackage{amsmath}usepackage{wasysym}usepackage{amsfonts}usepackage{amssymb}usepackage{amsbsy}usepackage{mathrsfs}usepackage{upgreek}setlength{oddsidemargin}{-69pt}egin{document}$$ imes$$end{document} and 1.96×documentclass[12pt]{minimal}usepackage{amsmath}usepackage{wasysym}usepackage{amsfonts}usepackage{amssymb}usepackage{amsbsy}usepackage{mathrsfs}usepackage{upgreek}setlength{oddsidemargin}{-69pt}egin{document}$$ imes$$end{document} higher load under preset tail latency objectives, respectively.

iBalancer: Load-Aware in-Server Flow Scheduling for Sub-Millisecond Tail Latency

Libra: A Stateful Layer-4 Load Balancer with Fair Load Distribution.

Halflife

In-network Congestion-aware Load Balancing at Transport Layer

Laconic: Streamlined Load Balancers for SmartNICs

Halflife: An Adaptive Flowlet-based Load Balancer with Fading Timeout in Data Center Networks.

Load balancing for heterogeneous traffic in datacenter networks

KnapsackLB: Enabling Performance-Aware Layer-4 Load Balancing

BurstBalancer: Do Less, Better Balance for Large-scale Data Center Traffic

SWP: Microsecond Network SLOs Without Priorities

L2BM: Switch Buffer Management for Hybrid Traffic in Data Center Networks

SeqBalance: Congestion-Aware Load Balancing with no Reordering for RoCE

HF^2T: Host-Based Flowlet Fine-Tuning for RDMA Load Balancing

Optimizing Flow Completion Time Via Adaptive Buffer Management in Data Center Networks

SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

DALB: A Dynamic Application-Sensitive Load Balancing Algorithm

PostMan: Rapidly Mitigating Bursty Traffic via On-Demand Offloading of Packet Processing

Active and Adaptive Application-Level Flow Control for Latency Sensitive RPC Applications

Balancer: A Traffic-Aware Hybrid Rule Allocation Scheme in Software Defined Networks.

Traffic-aware Buffer Management in Shared Memory Switches