Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service

Baolin Li,Siddharth Samsi,Vijay Gadepally,Devesh Tiwari
DOI: https://doi.org/10.1145/3581784.3607034
2023-09-01
Abstract:This paper presents a solution to the challenge of mitigating carbon emissions from hosting large-scale machine learning (ML) inference services. ML inference is critical to modern technology products, but it is also a significant contributor to carbon footprint. We introduce Clover, a carbon-friendly ML inference service runtime system that balances performance, accuracy, and carbon emissions through mixed-quality models and GPU resource partitioning. Our experimental results demonstrate that Clover is effective in substantially reducing carbon emissions while maintaining high accuracy and meeting service level agreement (SLA) targets.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reduce the carbon emissions generated when large - scale machine learning (ML) inference services run in data centers. Specifically, the paper proposes a carbon - friendly ML inference service running system named Clover, aiming to balance performance, accuracy and carbon emissions through mixed - quality models and GPU resource partitioning. The main goal of the paper is to significantly reduce carbon emissions while maintaining high accuracy and meeting service - level agreement (SLA) targets. ### Background and Motivation 1. **Carbon Emission Problems**: - Computing services in data centers account for 2% of global carbon emissions, and this proportion may further increase as the amount of data grows. - Large - scale technology companies are increasingly integrating artificial intelligence (AI) into their products and hosting trained machine - learning (ML) models on GPUs in data centers to provide ML inference services. These inference services exacerbate the carbon emission problem because they occupy a large number of computing cycles in data centers. - For example, many of Google's billion - level user services rely on AI, and its inference services account for 60% of AI infrastructure emissions; Meta has expanded its infrastructure capacity by 2.5 times to meet ML inference requirements; AWS and NVIDIA estimate that inference accounts for 90% of high - performance computing (HPC) and cloud data - center ML workloads. 2. **Trade - offs between Carbon - friendliness and Performance, Accuracy**: - Carbon - friendliness is usually in conflict with performance (low inference latency) and inference accuracy (high accuracy requires complex models and more computationally intensive operations, thus increasing the carbon footprint). - Currently, there is a lack of effective tools to automatically navigate this multi - dimensional trade - off space to make ML inference services carbon - friendly. ### Key Ideas and Contributions of Clover 1. **Opportunities for Mixed - quality Models and GPU Partitioning**: - For the first time, this paper experimentally demonstrates the opportunities and trade - offs of mixed - quality models and GPU partitioning in terms of carbon emission reduction. - Experiments show that creating a mixture of different - quality models can achieve significant carbon emission reduction while maintaining high accuracy. - GPU partitioning can reduce carbon emissions by optimizing resource utilization, but it may increase latency and lead to violations of SLA targets. 2. **A Novel Carbon - aware ML Inference Framework**: - Clover designs and implements a new carbon - footprint - aware ML inference service that can reduce carbon emissions, achieve high accuracy and meet SLA targets. - This framework combines two seemingly unrelated concepts: mixed - quality models and hardware - supported GPU partitioning, to minimize the carbon footprint. - Clover's intelligent GPU partitioning improves resource utilization efficiency and provides opportunities for carbon emission savings, but it may affect the SLA. By using different - quality model variants, Clover alleviates this challenge. Low - quality models allow Clover to minimize SLA violations, while high - quality models allow Clover to achieve high accuracy. - Clover's optimization engine dynamically adapts to changes in the carbon intensity of data - center energy sources to opportunistically achieve carbon and accuracy targets while meeting SLA targets. Clover's optimization process is completely online, does not require offline training data, and is practical. ### Evaluation Results - The actual - scene evaluation of the Clover system shows that it is very effective in reducing carbon emissions during model inference while still achieving high accuracy and meeting SLA constraints. - The evaluation is carried out in a representative production environment, using real - world carbon - intensity trajectories and ML models, including the BERT model for natural - language processing, object - detection and image - classification applications. - The actual - system prototype shows that Clover's performance is close to that of the Oracle technology, which is practically infeasible. ### Summary The paper proposes a carbon - friendly ML inference service running system named Clover. Through mixed - quality models and GPU resource partitioning, it achieves the goal of reducing carbon emissions while maintaining high accuracy and meeting SLA targets. This system provides a catalyst for the community to enhance and develop carbon - aware ML inference services.