AI Hardware Resource Monitoring in the Data Center Environment

Nanduri Vijaya Saradhi,
DOI: https://doi.org/10.55041/ijsrem36782
2024-07-24
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract:Deploying an AI (Artificial Intelligence) model in the data center initiates more responsibilities to the backend services such as Monitoring. It is required to monitor the performance of AI systems regularly to ensure that they meet the requirements and will not encounter any system performance issues. This whitepaper focuses on the importance of monitoring AI systems, the monitoring model, how to measure the performance of the system hardware resources such as CPU, Memory, disk and GPU, and tools to be used to monitor the system resources. Organisations can take necessary proactive maintenance actions before an incident is caused due to performance bottlenecks in the AI systems, proving the importance of monitoring the AI system. The goal of continuous monitoring of AI systems is to ensure the effective operation of AI systems throughout their lifecycle to meet several objectives such as performance, anomaly detection, security monitoring, data compliance and continuous improvements. Performance measurement of critical resources such as GPU, Memory and Storage by using suitable tools and configuring the alerts when the thresholds are reached on the identified resource threads. These measurements will be utilized to strengthen the AI system that will be stable for any performance bottlenecks.
What problem does this paper attempt to address?