CNN-based server state monitoring and fault diagnosis using infrared thermal images

Beltus Nkwawir Wiysobunri,Hamza Salih Erden,Behcet Ugur Toreyin
DOI: https://doi.org/10.1007/s00500-024-09792-y
IF: 3.732
2024-09-29
Soft Computing
Abstract:The recent spike in the demand for high-performance computing (HPC) server systems has birthed many challenges in data center (DC) facilities. These challenges include but are not limited to thermal management, system reliability sustenance, and server failure minimalization. In an attempt to solve the latter challenge, this paper proposes a deep convolutional neural network-based transfer learning approach for the automatic diagnosis of five server operation states: partial CPU load; maximum CPU load; main fan failure; CPU fan failure; and server entrance block- age. This transfer learning approach involves two main stages. The first stage consists of a deep neural network pretrained on the large ImageNet dataset that automatically extracts lower-level features. In stage two, the higher layers of the pre-trained deep neural networks are fine-tuned with limited labeled infrared images to classify each server operation state. A stratified five-fold cross-validation resampling method is employed to evaluate the effectiveness and generalization of deep neural network architectures. The performance of the proposed method is evaluated and compared to a traditional support vector machine classifier trained on hand-crafted features. The automatic feature extraction and the knowledge transfer capabilities of our approach are instrumental in the attainment of superior performance results, with the DenseNet-201 architecture achieving the highest average validation accuracy of 99.60% across five dataset sizes. The experimental results not only indicate the effectiveness and the robustness of deep neural networks trained with a small set of data, but also open up the possibility for DC operators to consider non-contact intelligent approaches to improving thermal management, energy efficiency, and system reliability of servers in DCs using infrared thermal sensor and machine learning.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?