Abstract:With high scalability and flexibility, serverless computing is becoming the most promising computing model. Existing serverless computing platforms initiate a container for each function invocation, which leads to a huge waste of computing resources. Our examinations reveal that (i) executing invocations concurrently within a single container can provide comparable performance to that provided by multiple containers (i.e., traditional approaches); (ii) redundant resources generated within a container result in memory resource waste, which prolongs the execution time of function invocations. Motivated by these insightful observations, we propose FaaSBatch - a serverless framework that reduces invocation latency and saves scarce computing resources. In particular, FaaSBatch first classifies concurrent function requests into different function groups according to the invocation information. Next, FaaSBatch batches the invocations of each group, aiming to minimize resource utilization. Then, FaaSBatch utilizes an inline parallel policy to map each group of batched invocations into a single container. Finally, FaaSBatch expands and executes invocations of containers in parallel. To further reduce invocation latency and resource utilization, within each container, FaaSBatch reuses redundant resources created during function execution. We conduct extensive experiments based on Azure traces to evaluate the effectiveness and performance of FaaSBatch. We compare FaaSBatch with three state-of-the-art schedulers Vanilla, SFS, and Kraken. Our experimental results show that FaaSBatch effectively and remarkably slashes invocation latency and resource overhead. For instance, when executing I/O functions, FaaSBatch cuts back the invocation latency of Vanilla, SFS, and Kraken by up to 72.58%, 74.10%, and 72.62%, respectively; FaaSBatch also slashes the resource overhead of Vanilla, SFS, and Kraken by 70.2% to 98.40%, 67.74% to 98.12%, and 43.01% to 78.90%, respectively.

HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions

Design and implementation of efficient distributed deep learning model inference architecture on serverless computation

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

BCEdge: SLO-Aware DNN Inference Services with Adaptive Batching on Edge Platforms

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication

Functions as a service for distributed deep neural network inference over the cloud‐to‐things continuum

FaaSBatch: Boosting Serverless Efficiency With In-Container Parallelism and Resource Multiplexing

BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge Devices

Graft: Efficient Inference Serving for Hybrid Deep Learning with SLO Guarantees via DNN Re-alignment

Orloj: Predictably Serving Unpredictable DNNs

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Distributed Assignment With Load Balancing for DNN Inference at the Edge

Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

Function Delivery Network: Extending Serverless Computing for Heterogeneous Platforms

Automating Cloud Deployment for Real-Time Online Foundation Model Inference

Dynamic Batching and Early-Exiting for Accurate and Timely Edge Inference