Abstract:Recent years have witnessed significant achievements in deep learning (DL) technologies. In the meantime, an increasing number of online service operators take advantage of deep learning to provide intelligent and personalized services. Although significant efforts have been put into optimizing the inference efficiency, our investigation shows that for many DL models that process data-intensive requests, the network I/O subsystem also plays an essential role in determining responsiveness. Furthermore, under the latency constraint, uncontrolled network flow processing will impact request batching. Based on the above observation, this paper proposes CoFB, an inference service system that optimizes performance in a holistic way. CoFB improves the load imbalance in the network I/O subsystem with a lightweight flow scheduling scheme that collaborates the network interface card with a dispatcher thread. In addition, CoFB introduces a request reordering and batching policy and an interference-aware concurrent batch throttling strategy for enforcing inference concerning the deadline. We evaluate CoFB on four DL inference services and compare it to two state-of-the-art inference systems: NVIDIA Triton and DVABatch. Experimental results show that CoFB outperforms these two baselines by serving up to 2.69×documentclass[12pt]{minimal}usepackage{amsmath}usepackage{wasysym}usepackage{amsfonts}usepackage{amssymb}usepackage{amsbsy}usepackage{mathrsfs}usepackage{upgreek}setlength{oddsidemargin}{-69pt}egin{document}$$ imes$$end{document} and 1.96×documentclass[12pt]{minimal}usepackage{amsmath}usepackage{wasysym}usepackage{amsfonts}usepackage{amssymb}usepackage{amsbsy}usepackage{mathrsfs}usepackage{upgreek}setlength{oddsidemargin}{-69pt}egin{document}$$ imes$$end{document} higher load under preset tail latency objectives, respectively.

BatOpt: Optimizing GPU-Based Deep Learning Inference Using Dynamic Batch Processing

MOC: Multi-Objective Mobile CPU-GPU Co-Optimization for Power-Efficient DNN Inference

DACO: Pursuing Ultra-low Power Consumption Via DNN-Adaptive CPU-GPU CO-optimization on Mobile Devices

SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms

Multi-user Co-inference with Batch Processing Capable Edge Server

ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG

Accelerating End-to-End Deep Learning Workflow With Codesign of Data Preprocessing and Scheduling.

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

Optimization of Edge Resources for Deep Learning Application with Batch and Model Management

DLBooster

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Profiling and optimizing deep learning inference on mobile GPUs.

Automating Cloud Deployment for Deep Learning Inference of Real-time Online Services

BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative Inference

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Pipeline-based Optimization Method for Large-Scale End-to-End Inference.