Abstract:Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct responsibilities to handle serving requests, i.e. generalpurpose CPUs for input preprocessing and domain-specific GPUs for forward computation. Recurrent neural networks play an essential role in handling temporal inputs and display distinctive computation characteristics because of their high inter-operator parallelism. Hence, we propose Chrion to optimize recurrent neural network inference by collaboratively utilizing CPUs and GPUs. We formulate the model deployment in the CPU-GPU cluster as an NP-hard scheduling problem of directed acyclic graphs on heterogeneous devices. Given an input model in the ONNX format and user-defined SLO requirement, Chrion firstly preprocesses the model by model parsing and profiling, and then partitions the graph to select execution devices for each operator. When an online request arrives, Chrion performs forward computation according to the graph partition by executing the operators on the CPU and GPU in parallel. Our experimental results show that the execution time can be reduced by 19.4% at most in the latency-optimal pattern and GPU memory footprint by 67.5% in the memory-optimal pattern compared with the execution on the GPU.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two key problems encountered when deploying deep - learning models in cloud clusters: 1. **GPU Memory Bottleneck**: - As the complexity of deep - learning models increases, GPU memory becomes a bottleneck for serving a large number of inference requests. The number of model parameters has increased from millions to billions, causing the memory required to load the entire model to increase from the MB level to the GB level. In addition, the temporary tensors generated when calling the CUDA API also increase the demand for GPU memory. This makes GPU memory capacity a key factor affecting inference performance. - Existing solutions mainly focus on optimizing algorithms and resource allocation, but ignore using the CPU to share the pressure on the GPU. 2. **Low CPU Resource Utilization**: - In deep - learning clusters, the utilization rate of the CPU is usually low. The CPU utilization in most commercial clusters is less than 30%, and some even have a CPU utilization rate lower than 40%. This low utilization rate leads to a waste of CPU resources. - Commercial inference servers are usually configured with multiple CPU cores, but when using the GPU for forward - calculation, most of these cores are idle. ### Solutions To solve the above problems, the paper proposes an inference framework named **Chrion**, specifically for recurrent neural network (RNN) models, which optimizes forward - calculation by协同利用 the CPU and GPU. Specifically, the main contributions of Chrion are as follows: 1. **Identify and utilize the potential for parallel execution in the CPU - GPU environment**: - Through a fine - grained graph partitioning algorithm, the operators of the RNN model are scheduled onto heterogeneous platforms (CPU and GPU) to achieve parallel execution. - This is the first research to support inter - operator parallelism across heterogeneous platforms. 2. **Design the Chrion framework**: - Chrion solves the GPU memory bottleneck problem and improves CPU utilization by协同利用 the CPU and GPU. - This framework accepts pre - trained RNN models (in ONNX format) and user - defined service - level objectives (SLOs) as inputs, and generates an optimal execution plan through model parsing, performance analysis, and graph partitioning. 3. **Adaptive graph partitioning algorithm**: - An adaptive graph partitioning algorithm is designed to select the execution platform for each operator. This algorithm is not model - specific, so it can be extended to more complex multi - branched models. 4. **Experimental verification**: - The high performance of Chrion has been verified through extensive experiments. The experimental results show that in the latency - optimization mode, the execution time of the model can be reduced by up to 19.4%, and in the memory - optimization mode, the GPU memory occupancy can be reduced by up to 67.5%. ### Summary By proposing the Chrion framework, this paper effectively solves the problems of GPU memory bottleneck and low CPU resource utilization in deep - learning clusters, especially when dealing with RNN models. By协同利用 the CPU and GPU, Chrion not only improves the system throughput, but also reduces the inference latency and GPU memory occupancy.

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

MOC: Multi-Objective Mobile CPU-GPU Co-Optimization for Power-Efficient DNN Inference

Automating Cloud Deployment for Real-Time Online Foundation Model Inference

Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs Colocation

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

Utilizing cloud FPGAs towards the open neural network standard

Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

Work-in-Progress: Furion: Alleviating Overheads for Deep Learning Framework on Single Machine

A Unified CPU-GPU Protocol for GNN Training

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

Sub-model Parallelism: A Scale-out Deployment Method for Large Multi-modal DNNs

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models

CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs

Ace-Sniper: Cloud-Edge Collaborative Scheduling Framework With DNN Inference Latency Modeling on Heterogeneous Devices

Optimum: Runtime Optimization for Multiple Mixed Model Deployment Deep Learning Inference

Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices

Automating Cloud Deployment for Deep Learning Inference of Real-time Online Services

DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices