Coca4ai: checking energy behaviors on AI data centers

Paul Gay,Éric Bilinski,Anne-Laure Ligozat
2024-07-22
Abstract:Monitoring energy behaviors in AI data centers is crucial, both to reduce their energy consumption and to raise awareness among their users which are key actors in the AI field. This paper shows a proof of concept of easy and lightweight monitoring of energy behaviors at the scale of a whole data center, a user or a job submission. Our system uses software wattmeters and we validate our setup with per node accurate external wattmeters. Results show that there is an interesting potential from the efficiency point of view, providing arguments to create user engagement thanks to energy monitoring.
Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to effectively monitor and optimize energy - use behaviors in AI data centers to reduce energy consumption and raise users' environmental awareness**. Specifically, the paper focuses on the following aspects: 1. **Reducing energy consumption**: By monitoring the energy use in AI data centers, identifying inefficient job configurations and under - utilized GPU resources, and then proposing improvement measures to reduce energy consumption. 2. **Raising user awareness**: By providing detailed energy - use data, helping users understand the environmental impact of their work, and further encouraging them to adopt more efficient computing strategies. 3. **Verifying the accuracy of the monitoring system**: Using software power meters (such as RAPL and Nvidia - smi) combined with an external accurate watt - meter to ensure that the recorded energy data is accurate and reliable. 4. **Promoting best practices**: Through the analysis of job status and resource utilization, identifying areas that can be optimized, and providing users with improvement suggestions, such as adjusting batch size or optimizing pre - processing steps. ### Research Background With the popularization of artificial intelligence (AI) technology, the issue of its environmental footprint has attracted increasing attention. Research shows that a large amount of electricity is consumed during AI model training and inference processes, which has a significant impact on the environment. To address this challenge, researchers have proposed various methods to evaluate and reduce the environmental impact in the ICT (information and communication technology) field, including using open - source software power meters to measure the energy consumption during program runtime, life - cycle analysis (LCA), indirect - effect assessment, etc. However, most of the current research on the energy behaviors in AI data centers is limited to simulation and lacks actual deployment and verification. Therefore, this paper aims to fill this gap and demonstrate a lightweight and easy - to - deploy energy - monitoring system that can monitor energy behaviors across the entire data center and provide users with specific optimization suggestions. ### Methods and Results Researchers deployed a monitoring system based on the SLURM cluster in the labia1 data center to record the CPU and GPU usage and power consumption of each job. By comparing the data of the software power meter and the external watt - meter, the accuracy of the system was verified. The results show that approximately 60% of the power consumption comes from unfinished jobs (such as failed, cancelled or timed - out jobs), while only 40% is used for completed jobs. In addition, most GPUs do not reach full - load operation, indicating that there is room for further optimization. ### Conclusions This research shows how to quickly deploy an effective energy - monitoring system using open - source tools and an external watt - meter, reveals the inefficiencies existing in current AI data centers, and provides users with a simple and easy - to - implement method to optimize their job configurations and resource usage, thereby reducing energy waste and improving work efficiency. This not only contributes to environmental protection but also promotes users to pay more attention to energy efficiency and takes an important step towards achieving sustainable development goals.